Engineering MetricsMarch 16, 2026 · 13 min read

Software Delivery Performance: The Research-Backed Framework for High-Performing Teams

After eight years and 32,000+ organizations studied, the DORA research program produced one finding that changed how engineering leaders think about their work: software delivery performance is not a technical metric — it is a business performance predictor. Teams in the elite tier grow revenue 4x faster and report 50% less burnout. Here is everything the research tells us about why, and what to do about it.

What this guide covers

The origin and definition of software delivery performance from the Accelerate book, the four DORA metrics with 2026 benchmarks, what the research found predicts performance, the four performance clusters (Elite to Low), why most teams are stuck at Medium, 90-day action plans for moving up, and how to instrument all four metrics from your existing toolchain.

What is Software Delivery Performance?

Software delivery performance is a term coined in the 2018 book Accelerate: The Science of Lean Software and DevOps by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. The book synthesized six years of research from the annual State of DevOps report and introduced a rigorous, measurement-based definition of what it means to be a high-performing software team.

Before Accelerate, most engineering leaders evaluated team performance using activity proxies: lines of code written, story points completed, tickets closed, velocity numbers. Forsgren and her co-authors argued — backed by structural equation modeling and psychometric validation across tens of thousands of survey responses — that these proxies measure effort and output, not outcomes. They proposed four specific metrics that, when measured together, predict both software quality and organizational performance.

The Accelerate thesis, stated directly: software delivery performance predicts organizational performance. Teams that ship software well — frequently, quickly, reliably, and with fast recovery — also outperform their peers on revenue growth, market share, employee satisfaction, and customer satisfaction. The research did not find that good software delivery causes business success, but the correlation held across every industry and company size studied, and after controlling for confounders like team tenure, technology stack, and company age.

The four metrics that define software delivery performance are deployment frequency, lead time for changes, change failure rate, and mean time to restore. Together they are now almost universally called the DORA metrics, after the DevOps Research and Assessment organization that produced the research.

The Research Foundation

The State of DevOps report is the largest longitudinal study of software delivery practices ever conducted. Run annually since 2014 — first independently, then under Google — it has surveyed over 32,000 technology professionals across thousands of organizations on six continents. The research methodology is unusually rigorous for an industry survey: it uses validated psychometric constructs, latent variable analysis, and structural equation modeling to distinguish causation-adjacent relationships from pure correlation.

The headline findings from eight years of data:

  • Elite software delivery performers have 4x higher revenue growth compared to low performers.
  • Elite performers have a 50% lower probability of employee burnout than low performers.
  • Elite performers are 2.6x more likely to exceed their profitability and productivity goals.
  • High deployment frequency correlates with high stability — not the inverse. Teams that ship more often do not have higher failure rates; they have lower ones.
  • Software delivery performance is entirely independent of whether a team uses Agile, SAFe, Scrum, Kanban, or no methodology at all.

That last point deserves emphasis. The research found that process frameworks — including popular enterprise Agile frameworks — have no significant effect on software delivery performance. What predicts performance is not the meetings you hold or the ceremonies you follow. It is the technical practices, cultural norms, and architectural decisions your team operates under every day.

The finding that software delivery performance is a business predictor — not merely a technical one — is what transformed DORA from a developer community concept into an executive-level framework. Engineering leaders can now make the case that investing in deployment automation, test coverage, and incident response is not overhead; it is directly linked to revenue and retention outcomes.

The Four DORA Metrics in Detail

The four metrics are the operational definition of software delivery performance. They come in two pairs: throughput metrics (deployment frequency, lead time) and stability metrics (change failure rate, MTTR). Elite performance means excelling at both pairs simultaneously.

1. Deployment Frequency

Deployment frequency measures how often your team successfully deploys code to production. It is the most direct indicator of your team's release cadence and, when combined with low failure rates, the strongest predictor of business agility. High-frequency deployment means your team can respond faster to user feedback, correct mistakes sooner, and ship value continuously rather than in batches.

What it captures: The health of your entire delivery pipeline — from how code is branched and reviewed, to how CI runs, to how deployments are orchestrated. Low deployment frequency is almost never caused by a single bottleneck; it reflects accumulated friction across the pipeline.

Data source: GitHub Deployments API, GitHub Releases, or deployment webhook events from your CI/CD system. The key filter is counting only successful deployments to the production environment.

TierDeployment Frequency (2026)What it looks like
EliteMultiple times per dayTrunk-based dev, feature flags, automated CI on every commit
HighOnce per day to once per weekShort-lived branches, automated deployment pipeline
MediumOnce per week to once per monthSprint-based releases, manual QA gates before each deploy
LowOnce per month or lessBatch releases, heavyweight change approval process (CAB, etc.)

2. Lead Time for Changes

Lead time for changes measures the elapsed time from a developer committing code to that code running in production. It is the primary measure of your team's throughput — how fast work flows through the delivery system. Short lead time means fast feedback loops: bugs get fixed sooner, experiments land sooner, and user requests turn into shipped features sooner.

Formula: Median time from PR merge (or first commit, depending on your definition) to successful production deployment. Most teams measure from PR merge timestamp to deployment timestamp, which captures the CI/CD pipeline. Use the median, not the mean — lead time distributions are right-skewed by occasional large refactors that inflate averages.

The nuance: Lead time measures pipeline speed, not planning or design cycle. A team that grooms for two sprints but deploys in 15 minutes once coding begins looks elite on lead time. Whether that planning cycle is appropriate is a different question that lead time alone cannot answer.

TierLead Time for Changes (2026)
EliteLess than one hour
HighOne hour to one day
MediumOne day to one week
LowOne week to one month

3. Change Failure Rate

Change failure rate (CFR) measures the percentage of deployments that result in a degraded service and require remediation — a rollback, hotfix, or patch. It is the primary stability metric and the number that most engineering leaders find most alarming when they first see it calculated accurately.

Formula: Failed deployments divided by total deployments, expressed as a percentage. A deployment counts as failed if it caused a service degradation, triggered a P0 or P1 incident, or required a remediating deployment within a defined window (usually 24 hours).

The nuance: CFR is a trailing indicator. By the time a failure is recorded, the incident has already happened and users have already been affected. It is valuable for trend analysis but provides no warning before a specific deployment that is about to cause an incident. This is the gap that deploy risk prediction fills — see the DORA metrics guide for more on the distinction between outcome metrics and predictive signals.

TierChange Failure Rate (2026)
Elite0–5%
High5–10%
Medium10–15%
Low>15%

4. Mean Time to Restore (MTTR)

Mean time to restore (MTTR) measures how long it takes your team to recover service after a production incident. It is the paired complement to change failure rate: CFR tells you how often you fail, MTTR tells you how badly you fail when you do. A team with a low CFR and high MTTR is more resilient than a team with the reverse — users encounter fewer problems, and when problems do occur, they resolve quickly.

Formula: Mean of (incident resolved timestamp minus incident opened timestamp) across all incidents in the measurement period. Use the median for most reporting; use the mean for SLA compliance discussions.

Critical definition point: MTTR is profoundly sensitive to how your team defines "resolved." If engineers close incidents when the service is restored but before the postmortem is complete, MTTR looks excellent. Standardize "resolved" as service restored to SLO — and enforce it consistently — or your MTTR trend data will be meaningless.

TierMTTR (2026)
EliteLess than one hour
HighLess than one day
MediumOne day to one week
LowMore than one week

What the Research Found Predicts Software Delivery Performance

The DORA research did not just measure outcomes — it investigated what causes them. Using structural equation modeling across multiple survey cycles, the research identified five categories of practices that consistently predict high software delivery performance. These are not opinion or best guesses; they are empirically validated antecedents.

Lean Product Management

Teams with high software delivery performance work in small batches. Small user stories, small PRs, small feature releases. They instrument their products with customer feedback mechanisms and run experiments — A/B tests, staged rollouts, feature flags — rather than making assumptions about what users want. The research found that this approach to product management predicts both faster delivery and higher software stability, because smaller changes are easier to reason about, test, and roll back.

Continuous Delivery Practices

The strongest technical predictor of software delivery performance is continuous delivery. Specifically: trunk-based development (working off a single main branch with short-lived feature branches rather than long-running parallel branches), automated testing at every stage of the pipeline, and deployment automation that allows code to be released at any time without manual steps.

Teams that practice all three of these simultaneously achieve deployment frequency and lead time improvements that are impossible through process changes alone. The automation removes the human gatekeeping that slows delivery and introduces variability.

Technical Practices

Beyond the CI/CD pipeline, the research identified several broader technical practices that predict performance: version controlling everything (infrastructure, configuration, not just application code), comprehensive monitoring and observability, and — most importantly — loosely coupled architecture. Teams building tightly coupled monoliths are fundamentally constrained in how fast they can deploy safely. Loosely coupled services allow teams to deploy independently, test in isolation, and roll back a single service without affecting others.

Organizational Culture

The research operationalized organizational culture using Westrum's typology: pathological (power-oriented), bureaucratic (rule-oriented), and generative (performance-oriented). Generative cultures — where information flows freely, failure leads to inquiry rather than blame, and cross-functional collaboration is the norm — predict high software delivery performance. Blameless postmortems are the most actionable implementation of generative culture in engineering contexts: they turn incidents into learning events rather than blame events, which makes teams willing to surface and address problems earlier.

Leadership

Transformational leadership — a style characterized by communicating a compelling vision, inspiring change, and intellectual stimulation — predicts software delivery performance. Servant leadership, where managers remove blockers and develop their teams rather than directing work, compounds the effect. The research found that leadership style has an indirect effect on performance through its influence on culture: leaders who model psychological safety create conditions where teams adopt the practices that predict performance.

The Four Performance Clusters

The State of DevOps report has consistently found that organizations cluster into four performance tiers rather than forming a smooth continuum. Understanding which tier your team occupies — and what separates you from the next one — is the starting point for any improvement initiative.

TierDeploy FrequencyLead TimeCFRMTTR
Elite (top 18%)Multiple/day<1 hour<5%<1 hour
HighDaily to weekly1 hour – 1 day5–10%<24 hours
MediumWeekly to monthly1 week – 1 month10–15%1 day – 1 week
LowMonthly or less>1 month>15%>1 week

Two observations from the cluster data that are rarely discussed. First: the gap between Elite and High is larger than the gap between High and Medium. Moving from High to Elite requires a different category of change than moving from Medium to High. Second: the majority of organizations in the DORA data sit at Medium. Getting to High is achievable for most teams with focused effort. Getting to Elite requires conditions that most organizations have not yet created.

Why Most Teams Are Stuck at Medium

The Medium tier is where most engineering organizations live, and where they stay for years. It is not because they are poorly led or technically incompetent. Medium-tier teams are typically shipping features, keeping the product running, and meeting stakeholder expectations well enough. The problem is that Medium is stable — it does not feel broken, so it does not generate urgency to change.

Three root causes account for the majority of teams stuck at Medium.

Deployment Automation Deficit

The most common root cause of low deployment frequency is not technical — it is psychological. When deployments are manual, error-prone, and require coordination across multiple people, teams become risk-averse. They batch up changes to amortize the cost and risk of deploying, which makes each deployment larger and therefore riskier, which justifies more caution, which produces fewer and larger deployments. This is a self-reinforcing loop.

The technical fix is deployment automation. The cultural fix is making the process of deploying so boring and mechanical that no one thinks twice about doing it. But many Medium-tier teams have partial automation — some manual steps remain, usually around environment configuration, database migrations, or smoke testing — and those remaining steps are enough to preserve the risk-aversion dynamic.

Testing Debt

High change failure rates at the Medium tier are almost always caused by insufficient automated test coverage. But the problem is not simply low coverage percentages — it is that the automated tests that do exist do not catch the failure scenarios that actually occur in production. Integration tests that mock external dependencies, unit tests that test implementation details rather than behavior, and end-to-end tests that only cover happy paths all create a false sense of security.

The result: CFR stays elevated not because the team is writing bad code but because the test suite does not catch the class of bugs that escape to production. Manual QA gates are added to compensate, which reduces deployment frequency, which increases batch sizes, which makes each deployment riskier. Again, a self-reinforcing loop.

Incident Response Theater

The third root cause of Medium-tier stagnation is the most insidious: postmortems that do not produce lasting change. Most Medium-tier organizations hold postmortems after significant incidents. The postmortem identifies root causes. Action items are assigned. Then the action items get deprioritized when the next sprint planning happens, because there is always feature work that is more visible to stakeholders. Three months later, a nearly identical incident occurs.

MTTR stays stuck not because the team responds slowly, but because the same incidents keep recurring. Improving MTTR requires eliminating repeat incidents, which requires actually completing the corrective actions from postmortems, which requires leadership prioritizing reliability work with the same rigor as feature work.

Moving from Medium to High: A 90-Day Action Plan

The Medium-to-High transition is achievable for most teams within one to two quarters with focused effort. The key is targeting one root cause at a time rather than launching a broad improvement initiative that touches everything simultaneously.

Month 1: Establish Baseline Measurement

You cannot improve what you cannot measure. Before any changes, instrument all four DORA metrics and let them run for 4 weeks. Many teams discover that their intuitions about where they stand are wrong — deployment frequency is lower than assumed, lead time is longer than assumed, or CFR is higher than assumed because no one was counting rollbacks and hotfixes consistently.

At the end of Month 1, identify which of the four metrics is furthest from the High tier threshold. That is your primary target for Month 2. Do not try to improve all four simultaneously.

Month 2: Address the Biggest Gap

If deployment frequency is the gap: audit your deployment process step by step and identify every manual step. Automate one manual step per week. Start with the easiest ones — smoke test automation, automated environment configuration — to build confidence before tackling harder ones like database migration automation.

If CFR is the gap: instrument which types of changes cause the most failures. Large PRs? Changes to specific modules? Changes during high-traffic windows? Find the pattern and address it specifically — rather than broadly mandating more test coverage, which is too vague to act on.

If MTTR is the gap: audit your last five incidents. How long from incident open to first responder engaged? How long from first responder engaged to root cause identified? How long from root cause identified to service restored? The longest phase in the chain is your specific target, not MTTR in aggregate.

Month 3: Measure Progress and Set Next Targets

At the end of Month 2, re-measure the targeted metric. If it has moved toward the High tier threshold, continue the current initiative in Month 3 and set the next quarter target. If it has not moved, investigate whether you are addressing the right root cause — the absence of improvement is diagnostic information, not a reason to give up.

Set targets for all four metrics for the next quarter at this point. The goal is to be in the High tier on all four metrics within 12 months of starting this process.

Related reading

Deployment frequency specifically is often the highest-leverage lever for Medium-tier teams. See the detailed guide on how to instrument it and what the benchmarks look like by company type.

Moving from High to Elite: The Harder Transition

The High-to-Elite transition is qualitatively different from the Medium-to-High transition. Medium-to-High is largely about removing friction and automating manual steps — it is an operational improvement initiative. High-to-Elite requires architectural changes and cultural maturity that cannot be mandated top-down.

Architectural Prerequisites

Deploying multiple times per day with a lead time under one hour is only possible if your architecture supports it. Tightly coupled systems — where deploying one service requires coordinating with three other teams and testing the entire system — create a structural ceiling on how frequently and quickly you can deploy. Reaching Elite requires loosely coupled services, well-defined APIs between components, and the ability to deploy a single service independently of others.

Feature flags are the second architectural prerequisite. Continuous delivery at Elite speed means merging to trunk frequently, which means incomplete features sometimes land in production. Feature flags allow those features to be shipped but invisible until they are ready — decoupling deployment from release. Without feature flags, frequent deployment means frequent user-facing changes, which raises the stakes on every deploy.

Cultural Maturity

High-to-Elite also requires psychological safety at a depth that most organizations have not achieved. In Elite organizations, developers surface concerns about a change before merging it — not after it causes an incident. They raise architectural concerns without fear of being seen as blockers. They escalate uncertainty rather than ship through it.

This level of psychological safety cannot be installed by a leadership announcement. It develops through years of consistently modeling blameless behavior in postmortems, consistently acting on concerns raised by developers, and consistently prioritizing reliability work rather than deferring it. Teams that have experienced leadership punishing people for surfacing problems — even once — take a long time to rebuild the trust required for this kind of openness.

The Elite transition cannot be forced top-down. It must be team-driven, with leadership providing the conditions (architectural investment, protected time for reliability work, psychological safety modeling) and teams developing the practices organically. Attempts to mandate Elite metrics without creating these conditions produce metric gaming rather than genuine improvement.

For a detailed case study of what Elite teams look like in practice, see what distinguishes elite DORA performers from high performers.

How to Measure Software Delivery Performance

The good news: all four DORA metrics can be calculated from data that most engineering teams already collect. The instrumentation is straightforward once you know what to connect to what.

Deployment Frequency and Lead Time from GitHub

GitHub's Deployments API records every deployment event your CI/CD pipeline sends it. For deployment frequency: query successful deployments to the production environment and count by day, week, or month. For lead time: pull the merged_at timestamp from each merged PR and match it to the subsequent production deployment that includes that PR's merge commit SHA. The delta is lead time for that change.

If your CI/CD system does not already emit deployment events to GitHub, a minimal GitHub Actions step can record them on every successful production deploy. Once deployment events exist in GitHub, any analytics platform — including Koalr — can pull them automatically.

Change Failure Rate from Deployment Status

For each deployment in the period, check its latest status via the GitHub Deployments Statuses API. A deployment is failed if its status is failure, or if a subsequent deployment was created within 24 hours with a description matching your team's rollback convention. Divide failed count by total count for CFR.

The harder part of CFR is incident-to-deployment attribution: linking a specific incident to the deployment that caused it. This requires either manual tagging in your incident tool or automated attribution logic that finds the most recent deployment before an incident opened. Without this attribution, CFR only captures deployments that failed at the infrastructure level (rollbacks), not deployments that caused application- level incidents that required a hotfix.

MTTR from Your Incident Platform

GitHub alone cannot give you accurate MTTR. You need an incident management tool that records incident open and resolve timestamps. PagerDuty, OpsGenie, and incident.io all expose this data via API. The calculation is straightforward: for each incident in the period, compute resolved timestamp minus created timestamp, then take the mean across all incidents.

DORA MetricData SourceMinimum Viable Instrumentation
Deployment FrequencyGitHub Deployments APICI/CD emitting deployment events with environment=production and state=success
Lead TimeGitHub PRs + Deployments APIMerged PR timestamps correlated with deployment SHA
Change Failure RateGitHub Deployments + incident toolDeployment status tracking + incident-to-deployment attribution
MTTRPagerDuty / OpsGenie / incident.ioIncident created_at and resolved_at timestamps; consistent "resolved" definition

The most common instrumentation gap is MTTR. Teams that use PagerDuty or OpsGenie for on-call routing often have the data they need but have not piped it into a calculation. Teams without a dedicated incident tool have no automated source for MTTR and often underestimate it because they calculate it from hotfix deployment timestamps rather than actual service restoration timestamps.

If you want to see where you currently stand across all four metrics without building the instrumentation yourself, Koalr connects to GitHub, PagerDuty, OpsGenie, and incident.io and calculates all four metrics automatically from your existing event data.

Key takeaways

  • Software delivery performance is a business predictor, not just a technical one — the DORA research validated this across 32,000+ organizations.
  • The four metrics (deployment frequency, lead time, CFR, MTTR) measure both throughput and stability. Elite teams excel at both simultaneously.
  • Most teams are stuck at Medium because of deployment automation gaps, testing debt, and incomplete incident follow-through — not capability deficits.
  • Medium-to-High is an operational improvement. High-to-Elite requires architectural and cultural change that cannot be mandated.
  • All four metrics can be instrumented from GitHub, PagerDuty/OpsGenie/incident.io data you likely already collect.

For a deep dive into the metrics themselves — including formulas, GitHub API queries, and common calculation mistakes — see the complete guide to DORA metrics. For the specific question of what separates Elite from High in practice, see what distinguishes elite DORA performers. And if deployment frequency is your biggest gap, see how to improve deployment frequency.

See your software delivery performance in one dashboard

Koalr connects to GitHub, PagerDuty, OpsGenie, and incident.io and calculates all four DORA metrics automatically — plus deployment risk scores on every open PR so you can act before CFR becomes a problem. No data pipelines to build.