Mean Time to Restore
Mean Time to Restore (MTTR) measures the average time it takes to recover from a production failure or incident. In the DORA framework, MTTR specifically measures how long it takes to restore service after a degradation or outage — from the moment an incident is detected to when service is fully restored. It is the DORA metric most directly tied to user-facing reliability.
Formula
MTTR = Sum of (Incident resolved time − Incident triggered time) ÷ Number of incidents MTTR components: - Detection time: Alert triggered → team acknowledges (also called MTTD) - Response time: Acknowledgment → active remediation begins - Recovery time: Remediation begins → service restored MTTR = Detection time + Response time + Recovery time Example: 3 incidents with durations of 45 min, 2 hours, and 1.5 hours MTTR = (45 + 120 + 90) ÷ 3 = 85 minutes
Industry Benchmarks
| Performance Level | Benchmark (hours / days) |
|---|---|
| Elite | Less than one hour |
| High | Less than one day |
| Medium | Less than one week |
| Low | More than one week |
Benchmarks from the 2023 State of DevOps Report (DORA). MTTR correlates strongly with on-call process maturity, runbook completeness, and observability tooling. Elite MTTR (< 1 hour) requires automated alerting, clear escalation paths, and well-practiced incident playbooks.
Data Sources
- PagerDuty: incident triggered, acknowledged, and resolved timestamps
- incident.io: incident declared, mitigated, and resolved timeline events
- Opsgenie (legacy): alert open, ack, and close timestamps (Koalr migration wizard preserves history)
- GitHub: hotfix PR creation and merge timestamps as a proxy for remediation time
Why Mean Time to Restore matters
MTTR is the metric that engineering leaders show to boards and customers as evidence of reliability. A system that goes down rarely but takes 6 hours to restore is often harder to operate than one that goes down more often but restores in 15 minutes. MTTR reflects the maturity of your on-call process, the quality of your observability tooling, and the effectiveness of your runbooks. High MTTR is also a leading indicator of on-call burnout: responders who spend hours firefighting per incident accumulate stress, context-switching cost, and sleep debt that compounds into attrition.
Common measurement mistakes
Measuring MTTR from the moment a developer starts working on a fix, not from when the incident was triggered. MTTR must include detection time — incidents that go undetected for hours before being acknowledged have that entire window counted against MTTR, which is the correct behavior. Excluding detection time masks alerting latency problems.
Averaging MTTR across all severity levels without segmenting. A P1 outage mixed with a P4 minor degradation produces a meaningless average. Always segment MTTR by severity level (P1, P2, P3) and report each separately.
Declaring incidents resolved before service is actually restored. "Resolved" should mean the system is back to its normal operating state, not that the engineering team has stopped actively responding. Premature resolution timestamps deflate MTTR.
How Koalr measures Mean Time to Restore
Koalr calculates MTTR from PagerDuty or incident.io incident timelines. The clock starts when an incident is triggered (or declared, in incident.io) and stops when it is resolved. Koalr segments MTTR by severity level, by team (based on on-call schedule ownership), and by time period. MTTR trends are displayed alongside deployment frequency and CFR charts so that teams can see whether their operational load is improving alongside their delivery metrics. For teams migrating from Opsgenie, Koalr's migration wizard preserves historical incident data so MTTR trends survive the platform switch.
Related DORA metrics
See your Mean Time to Restore in Koalr
Connect GitHub and your deployment platform to start tracking all four DORA metrics automatically. No manual data entry, no spreadsheets.