DORA MetricMTTRMean Time to RecoveryMean Time to RepairMTTR Service

Mean Time to Restore

Mean Time to Restore (MTTR) measures the average time it takes to recover from a production failure or incident. In the DORA framework, MTTR specifically measures how long it takes to restore service after a degradation or outage — from the moment an incident is detected to when service is fully restored. It is the DORA metric most directly tied to user-facing reliability.

Formula

MTTR = Sum of (Incident resolved time − Incident triggered time) ÷ Number of incidents

MTTR components:
- Detection time: Alert triggered → team acknowledges (also called MTTD)
- Response time: Acknowledgment → active remediation begins
- Recovery time: Remediation begins → service restored

MTTR = Detection time + Response time + Recovery time

Example: 3 incidents with durations of 45 min, 2 hours, and 1.5 hours
MTTR = (45 + 120 + 90) ÷ 3 = 85 minutes

Industry Benchmarks

Performance LevelBenchmark (hours / days)
EliteLess than one hour
HighLess than one day
MediumLess than one week
LowMore than one week

Benchmarks from the 2023 State of DevOps Report (DORA). MTTR correlates strongly with on-call process maturity, runbook completeness, and observability tooling. Elite MTTR (< 1 hour) requires automated alerting, clear escalation paths, and well-practiced incident playbooks.

Data Sources

  • PagerDuty: incident triggered, acknowledged, and resolved timestamps
  • incident.io: incident declared, mitigated, and resolved timeline events
  • Opsgenie (legacy): alert open, ack, and close timestamps (Koalr migration wizard preserves history)
  • GitHub: hotfix PR creation and merge timestamps as a proxy for remediation time

Why Mean Time to Restore matters

MTTR is the metric that engineering leaders show to boards and customers as evidence of reliability. A system that goes down rarely but takes 6 hours to restore is often harder to operate than one that goes down more often but restores in 15 minutes. MTTR reflects the maturity of your on-call process, the quality of your observability tooling, and the effectiveness of your runbooks. High MTTR is also a leading indicator of on-call burnout: responders who spend hours firefighting per incident accumulate stress, context-switching cost, and sleep debt that compounds into attrition.

Common measurement mistakes

Measuring MTTR from the moment a developer starts working on a fix, not from when the incident was triggered. MTTR must include detection time — incidents that go undetected for hours before being acknowledged have that entire window counted against MTTR, which is the correct behavior. Excluding detection time masks alerting latency problems.

Averaging MTTR across all severity levels without segmenting. A P1 outage mixed with a P4 minor degradation produces a meaningless average. Always segment MTTR by severity level (P1, P2, P3) and report each separately.

Declaring incidents resolved before service is actually restored. "Resolved" should mean the system is back to its normal operating state, not that the engineering team has stopped actively responding. Premature resolution timestamps deflate MTTR.

How Koalr measures Mean Time to Restore

Koalr calculates MTTR from PagerDuty or incident.io incident timelines. The clock starts when an incident is triggered (or declared, in incident.io) and stops when it is resolved. Koalr segments MTTR by severity level, by team (based on on-call schedule ownership), and by time period. MTTR trends are displayed alongside deployment frequency and CFR charts so that teams can see whether their operational load is improving alongside their delivery metrics. For teams migrating from Opsgenie, Koalr's migration wizard preserves historical incident data so MTTR trends survive the platform switch.

Related DORA metrics

See your Mean Time to Restore in Koalr

Connect GitHub and your deployment platform to start tracking all four DORA metrics automatically. No manual data entry, no spreadsheets.