What MTTR Actually Measures

Mean time to restore (MTTR) — sometimes called mean time to recovery — measures the average elapsed time from the start of a service degradation to the point where the service is restored to normal operation. The DORA research defines it specifically as the time to restore service after a production incident.

The key word in that definition is "start." MTTR begins when the incident begins — when customers first experienced the degradation — not when your monitoring system detected it, and not when an engineer acknowledged the page. The detection lag (the time between impact start and alert firing) is part of MTTR, not separate from it. Teams that measure only from alert time systematically understate their true MTTR by anywhere from minutes to hours.

The formula is straightforward:

MTTR = Mean(incident resolved_at − incident impact_start_at)

The challenge is instrumentation: most incident tools record the time the incident was created (often by the monitoring alert) rather than the time the impact actually started. Accurate MTTR requires either automated impact start detection or a discipline of manually backdating incident creation to the actual start of user impact.

DORA Benchmarks: Where Does Your Team Stand?

DORA classifies MTTR performance into four bands based on the research data:

Performance Band	MTTR	What it requires
Elite	Less than one hour	SLO-based alerting, automated rollbacks, practised runbooks, healthy on-call rotation
High	Less than one day	Alert coverage on critical paths, documented runbooks, consistent escalation path
Medium	Less than one week	Some alerting, inconsistent runbooks, escalation driven by manual detection
Low	One month or more	Reactive detection (customers report issues), no runbooks, ad hoc response

It is worth noting that MTTR alone does not tell the full story. A team with a two-hour MTTR that has 20 incidents per month is worse off in terms of total user impact than a team with a six-hour MTTR and two incidents per month. MTTR should always be read alongside change failure rate — CFR tells you how often you fail, MTTR tells you how quickly you recover when you do.

The Three Phases of MTTR

MTTR is not a single event — it is the sum of three distinct phases, each with different root causes and different improvement levers. Teams that try to improve MTTR as a single number often miss which phase is actually dominating their total.

Phase 1: Detection

Detection time is the elapsed time from when the incident started affecting users to when your monitoring system fired an alert. It is the most preventable component of MTTR and often the most neglected.

Teams with poor detection time typically have one of three problems: alert thresholds that are too high (waiting for 5% error rate before alerting when 1% already degrades user experience), no alerting on critical user journeys (only infrastructure metrics, not product metrics), or alert fatigue (so many noisy low-priority alerts that the real one is buried in the noise).

SLO-based alerting — alerting when your error budget burn rate exceeds a threshold, rather than when a metric crosses an absolute threshold — is the most effective approach to reducing detection time. SLO burn rate alerts fire earlier, fire less often, and are better calibrated to actual user impact than threshold-based alerts.

Phase 2: Response

Response time is the elapsed time from alert firing to the first engineer taking meaningful action on the incident. It is driven by on-call coverage, paging configuration, acknowledgment practices, and how quickly the on-call engineer can understand the scope of the incident.

Response time problems are usually visible in the data: high acknowledgment latency (engineers not acknowledging pages promptly), high escalation rates (the primary on-call cannot diagnose the issue and has to escalate), or high incident commander assignment lag (no one picks up the coordination role quickly).

On-call rotation health is the primary lever for response time. Burned-out, overloaded on-call engineers respond more slowly, escalate more, and make more errors under pressure. Monitoring on-call burden — pages per engineer per week, incident acknowledgment latency, overtime hours during on-call shifts — is a prerequisite for improving response time sustainably.

Phase 3: Resolution

Resolution time is the elapsed time from first meaningful action to service restored. It is the phase where runbooks, rollback capability, and system observability have the most leverage.

The fastest resolutions happen when the on-call engineer already knows what to do — because this class of failure has happened before and there is a runbook that was tested and updated after the last occurrence. The slowest resolutions happen when the engineer is diagnosing a novel failure in a system they do not fully understand, without adequate observability to narrow down the cause.

Every incident is an opportunity to improve resolution time for the next occurrence of the same failure class. The blameless postmortem is the mechanism — but it only produces improvement if the action items include updating the relevant runbook and adding the missing observability that slowed diagnosis this time.

Phase	Metric to track	Primary lever
Detection	Alert lag (impact start → alert fired)	SLO-based alerting, synthetic monitoring
Response	Acknowledgment latency (alert → ack)	On-call rotation health, escalation paths
Resolution	Time to resolve (ack → service restored)	Runbooks, rollback capability, observability

Tactics to Reduce MTTR

SLO-Based Alerting

Replace threshold-based alerts ("alert when error rate > 5%") with burn rate alerts ("alert when error budget is being consumed at 14× the normal rate"). Burn rate alerts fire earlier on real user impact, fire less frequently on noise, and are naturally calibrated to your reliability targets. Google's SRE workbook provides a practical implementation framework for multi-window burn rate alerts that has become the industry standard.

Runbook Discipline

A runbook is only valuable if it is current, specific, and has been tested. Stale runbooks that describe how a system worked six months ago are worse than no runbook — they send the on-call engineer down the wrong diagnostic path. Every postmortem action item should include "update runbook X to cover this failure mode." Every major system change should include "update the relevant runbook before merging."

On-Call Rotation Health

The on-call engineer is the variable that determines response and resolution speed for every incident. Burned-out on-call engineers produce slower, lower-quality incident responses. Tracking on-call burden metrics — pages per on-call shift, incidents during business hours vs. outside hours, time spent on incident response per week — gives you the data to have a concrete conversation about whether your rotation is sustainable before it produces attrition.

Blameless Postmortems

A blameless postmortem is not a blame-free incident review — it is an incident review where the goal is system improvement rather than individual accountability. The distinction matters because individuals who fear blame hide information that is essential for accurate root cause analysis. Blameless postmortems produce better root cause analysis, more specific action items, and higher team trust over time. Teams that practice them consistently drive their MTTR down over multi-quarter timelines.

The incident → DORA feedback loop

Every resolved incident should trigger an update to two DORA metrics: MTTR (did this incident change your rolling average?) and change failure rate (was this incident caused by a deployment?). The feedback loop from incident → postmortem → process improvement → DORA trend is the core improvement engine for high-performing engineering organizations. If your incidents are not automatically updating your DORA dashboard, you are missing half the signal.

Common MTTR Measurement Mistakes

Most teams that are dissatisfied with their MTTR trend are partially measuring the wrong thing. The most common mistakes:

Measuring from alert time, not impact start. If your monitoring fires five minutes after users started experiencing errors, those five minutes are part of your MTTR. Measuring from alert time produces a number that feels better and hides detection lag as an improvement opportunity.

Excluding weekends and holidays. If a Friday-evening incident takes until Monday morning to resolve, the MTTR for that incident is 60+ hours — not two hours of active work on Monday. Excluding off-hours from MTTR calculations produces a number that is irrelevant to your customers, who experienced the degradation for the full duration.

Closing incidents before service is stable. Some teams resolve incidents as soon as the immediate crisis is over — the site is back up, the page is acknowledged — even if the root cause has not been addressed and the service is operating in a degraded state. Incidents should be marked resolved when service is restored to its SLO, not when the on-call engineer stops actively working on it.

Using mean instead of median. MTTR distributions are heavily right-skewed — most incidents resolve quickly, but a small number of catastrophic incidents take days or weeks. Reporting the mean is dominated by those outliers. Reporting the median gives a better picture of typical recovery time; report the 90th percentile to track your worst-case scenario separately.

Not linking incidents to deployments. An MTTR dashboard without deployment attribution cannot tell you whether your incident rate is improving because your deployments are getting safer or because you are shipping less frequently. The incident-to-deployment link is the connection between MTTR and change failure rate — the two stability metrics in DORA — and it requires either manual tagging or automated attribution logic.

MTTR Deep Dive: How to Measure and Reduce Mean Time to Restore