The four key metrics — deployment frequency, lead time for changes, change failure rate, and mean time to restore — were popularized by the Accelerate book and the annual State of DevOps Report. But most engineering leaders who track them make the same handful of mistakes: measuring the wrong thing, chasing numbers without addressing root causes, or treating them as individual KPIs rather than a system.
This guide covers each metric in depth: what it actually measures, where teams go wrong, the elite benchmarks you should be targeting, and the highest-leverage actions to improve your numbers.
1. Deployment Frequency
Deployment frequency measures how often your team releases code to production. It's the single best proxy for batch size, risk tolerance, and team autonomy. Elite performers deploy multiple times per day. High performers deploy between once per day and once per week. Medium performers deploy weekly to monthly. Low performers deploy less than once per month.
The most common mistake is measuring deployments per service rather than per team. A team that owns 12 microservices and deploys each one twice a week is not the same as a team that deploys a monolith twice a week — but both might report the same aggregate number. Normalize by measuring deployable units: how many independent production changes does the team ship per week?
Why high deployment frequency matters
Small, frequent deployments reduce blast radius when something goes wrong. They also compress the feedback loop between writing code and seeing it in production, which accelerates learning and makes debugging dramatically easier. Teams that deploy daily spend significantly less time in incident review than teams that batch into weekly releases.
To increase deployment frequency, attack batch size first. If your PRs average 800 lines of diff, no amount of CI investment will get you to daily deployments. Target PRs under 400 lines. Use feature flags to decouple deployment from release — code can go to production behind a flag long before it's enabled for users. Invest in zero-downtime deployment infrastructure (blue-green, canary) so engineers stop dreading the deploy button.
2. Lead Time for Changes
Lead time measures the elapsed time from code commit to running in production. Elite teams achieve lead times under one hour. High performers land between one hour and one day. Medium performers take one day to one week. Low performers take more than a month.
The most common mistake is measuring from PR open to merge, ignoring the time code sits in a release branch or deployment queue after merge. True lead time includes: time to first review, review-to-approval latency, merge-to-deploy queue time, and CI/CD pipeline duration. Most teams are surprised to discover that 60–70% of their lead time accumulates after merge, in deployment gates and release queues they assumed were fast.
Decompose lead time into stages
Break lead time into: coding time (commit to PR open), review wait time (PR open to first review), review cycle time (first review to approval), and deployment time (merge to production). Each stage has different root causes and different owners. Coding time is a planning problem. Review wait time is a team culture problem. Deployment time is an infrastructure problem.
The highest-leverage intervention for most teams is eliminating the PR review queue. If engineers are waiting more than four hours for a first review, establish a team norm around review SLAs. A simple rule — review any PR flagged for review within two hours during business hours — can halve lead time for many teams without any infrastructure investment.
3. Change Failure Rate
Change failure rate is the percentage of deployments that cause a production incident, rollback, or hotfix. Elite and high performers maintain a rate between 0% and 15%. Medium performers land at 16–30%. Low performers exceed 30%.
The measurement trap here is definition. Many teams only count incidents that page on-call, missing the silent failures: deployments that require a same-day hotfix, deployments that trigger a customer support spike, or deployments that are quietly rolled back without a formal incident. A better definition: any deployment that requires an unplanned action within 24 hours counts as a failure.
High change failure rate is almost always caused by one of three things: insufficient test coverage (code ships that nobody verified works), large batch size (more surface area per deployment means more ways to fail), or missing deployment signals (no canary, no SLO burn rate alerting, no automated rollback). The fix is rarely "write more tests"— it's about understanding which specific changes are failing and why.
Track failure attribution, not just rate
When a deployment fails, record why: infrastructure change, dependency update, database migration, feature code, configuration change. After 30 days you'll see patterns. Most teams discover that 70% of failures trace to a small category of change type — often DDL migrations or third-party dependency upgrades — that can be addressed with targeted process changes.
4. Mean Time to Restore (MTTR)
MTTR measures how quickly your team recovers when something breaks in production. Elite performers restore service in under one hour. High performers in under one day. Medium performers in one to seven days. Low performers take more than a week.
MTTR is the most politically sensitive of the four metrics because it surfaces on-call burden, incident response maturity, and the quality of your observability stack simultaneously. Teams with high MTTR often have three overlapping problems: they don't know something is broken until a customer reports it (detection gap), they can't quickly identify which deployment caused the regression (isolation gap), and they can't roll back safely without manual database intervention (recovery gap).
The detection gap is fixed by SLO-based alerting that fires before customers notice. The isolation gap is fixed by correlated deployment + incident dashboards — when an alert fires, engineers should see exactly which services were deployed in the last two hours alongside error rate graphs. The recovery gap is fixed by investing in one-click rollback and ensuring database migrations are always backward-compatible.
Using the Four Metrics as a System
The research finding that surprises most engineering leaders: deployment frequency and lead time are not in tension with change failure rate and MTTR. Elite teams are fastand stable. The intuition that "deploying more often means more risk" is wrong. Frequent, small deployments are easier to verify, easier to roll back, and expose failures faster with less blast radius.
When you look at all four metrics together, failure patterns become visible. High change failure rate combined with low deployment frequency suggests teams are batching to compensate for unreliable deployments — a vicious cycle where large batches cause failures, so teams batch less often, making each deployment even riskier. The fix is to shrink batch size and invest in deployment safety in parallel, not sequentially.
Common anti-pattern: gaming the metrics
Once teams know they're being measured, deployment frequency is easy to game: deploy config changes as "code deployments," split trivial changes into many small PRs, or exclude failing services from the calculation. The solution is to track all four metrics together and display the raw data — not rolled-up scores — so gaming one number makes the others look worse.
Review the four metrics monthly in your engineering leadership sync, not weekly. Weekly fluctuations are noisy. You want to see the 30-day rolling trend for each metric and use it to prioritize the next quarter's infrastructure and process investments. When lead time spikes, that's the signal to invest in CI performance or review culture. When change failure rate rises, that's the signal to tighten deployment gates or shrink batch sizes. Let the metrics tell you where to invest.