The Complete Guide to DORA Metrics in 2026

What is DORA?

DORA stands for DevOps Research and Assessment, a research program originally founded by Nicole Forsgren, Jez Humble, and Gene Kim before its acquisition by Google. Since 2014, the DORA team has surveyed over 30,000 engineering professionals across thousands of organizations, producing the annual State of DevOps Report — the largest longitudinal study of software delivery performance in existence.

The core finding, replicated across nine years of data, is that software delivery performance correlates with four specific metrics. These four metrics predict not just engineering outcomes but business outcomes: organizations in the elite performance tier are 2.6x more likely to exceed revenue and profitability goals than low performers. The research controlled for industry, team size, and technology stack — the four metrics held across all of them.

The reason these four metrics specifically: they capture both throughput (how fast you ship) and stability (how reliably you ship). Optimizing for throughput alone produces fragile systems. Optimizing for stability alone produces slow teams. Elite organizations achieve both simultaneously — and the four metrics are the instrument panel that shows whether you're doing it.

The Four DORA Metrics Explained

1. Deployment Frequency

Deployment frequency measures how often your team successfully deploys code to production. It is the most direct indicator of your release cadence and, when combined with stable failure rates, the strongest predictor of business agility.

Formula: Number of successful production deployments per time period (day / week / month). Typically normalized to deployments-per-day or deployments-per-week.

Data source: GitHub Deployments API (GET /repos/{owner}/{repo}/deployments), GitHub Releases, or deployment webhook events from your CI/CD pipeline (GitHub Actions, CircleCI, ArgoCD, Railway, Vercel deployment hooks).

The nuance: deployment frequency only counts successful deployments. A deployment that rolls back immediately does not count as a successful deployment — it counts against your change failure rate.

Performance Band	Deployment Frequency	What it looks like
Elite	Multiple times per day	Trunk-based dev, feature flags, CI on every commit
High	Once per day to once per week	Short-lived branches, automated deployment pipeline
Medium	Once per week to once per month	Sprint-based releases, manual QA gates
Low	Once per month or less	Batch releases, heavyweight change approval process

2. Lead Time for Changes

Lead time for changes measures the elapsed time from a developer committing code to that code running in production. It is the primary measure of your team's throughput — how fast work flows through the system from idea to live.

Formula: Median time from first commit on a branch (or PR open, depending on your definition) to successful production deployment. Most teams use PR merge timestamp → production deployment timestamp, which captures the CI/CD portion of the pipeline.

How to calculate from GitHub: Pull the merged_at timestamp from the Pull Requests API and the created_at timestamp from the corresponding Deployment that follows it. The delta is your lead time for that change. Aggregate across all PRs merged in the period and take the median (not the mean — lead time distributions are heavily right-skewed by large refactors and infrastructure changes).

The nuance: lead time measures the delivery pipeline, not the planning or design cycle. A team that spends three sprints grooming a feature but deploys it in 20 minutes once coding starts looks like an elite performer on lead time. Whether that's good depends on your product context.

Performance Band	Lead Time
Elite	Less than one hour
High	One hour to one day
Medium	One day to one week
Low	One week to one month

3. Change Failure Rate

Change failure rate (CFR) measures the percentage of deployments that result in a degraded service or require remediation — a rollback, hotfix, or patch. It is the primary stability metric in the DORA framework.

Formula:

CFR = (Failed deployments / Total deployments) × 100%

What counts as a failure: A deployment counts as failed if it caused a service degradation that required a rollback, triggered a P0/P1 incident, or required a hotfix deployed within a defined window (typically 24 hours). The exact definition should be standardized across your organization before you start measuring — inconsistent definitions make CFR trends meaningless.

Data source: GitHub Deployments API deployment status (failure or inactive after rollback), correlated with incident data from PagerDuty, OpsGenie, or incident.io. The correlation step is where most manual implementations break down — linking a specific deployment to a specific incident requires either manual tagging or automated deployment-to-incident attribution logic.

Performance Band	Change Failure Rate
Elite	0–5%
High	5–10%
Medium	10–15%
Low	>15%

4. Mean Time to Restore (MTTR)

Mean time to restore (MTTR) — sometimes called mean time to recovery — measures how long it takes your team to recover service after a production incident. It is the paired complement to change failure rate: CFR tells you how often you fail, MTTR tells you how badly you fail when you do.

Formula:

MTTR = Mean(incident resolved_at − incident created_at)

Data source: Incident management platforms — PagerDuty, OpsGenie, incident.io. The incident opened timestamp and resolved timestamp are the two data points required. Teams without a dedicated incident tool often attempt to calculate MTTR from GitHub PR merged timestamps for hotfixes, which systematically underestimates true recovery time by ignoring detection lag.

The nuance: MTTR is profoundly sensitive to how your team defines "resolved." If on-call engineers close incidents before the postmortem, MTTR looks excellent. If they leave incidents open until the postmortem is written, it looks terrible. Define "resolved" as service restored to SLO, and enforce it consistently.

Performance Band	MTTR
Elite	Less than one hour
High	Less than one day
Medium	Less than one week
Low	One week or more

How to Calculate DORA from GitHub Data

Most teams have GitHub. Not all teams have a unified analytics platform yet. Here is how to instrument each metric from raw GitHub API data and your existing CI/CD pipeline.

Deployment Frequency from GitHub Deployments API

GitHub's Deployments API records every deployment event your CI/CD pipeline sends it. If you are using GitHub Actions, each successful workflow run that targets production can create a deployment record. If you use an external CI system (CircleCI, Jenkins, Railway, Vercel), configure it to POST a deployment event back to GitHub on successful production deploys.

Query the API at GET /repos/{owner}/{repo}/deployments?environment=production and filter for deployments with a statuses entry of success. Count them per day, week, or month.

Lead Time from PR Merge → Deployment Correlation

Pull merged_at from every merged PR in the period from GET /repos/{owner}/{repo}/pulls?state=closed. Then pull the deployment that contains the merge commit SHA — each deployment has a sha field you can match against the PR's merge_commit_sha. The lead time for that change is deployment.created_at − pull_request.merged_at.

Change Failure Rate from Deployment Status

For each deployment, fetch its latest status from GET /repos/{owner}/{repo}/deployments/{deployment_id}/statuses. A deployment is failed if its latest status is failure or if a subsequent deployment was created within 24 hours with description matching a rollback convention your team defines. Divide failed deployment count by total deployment count for the period.

MTTR Requires an Incident Platform

GitHub alone cannot give you MTTR. You need an incident management tool — PagerDuty, OpsGenie, or incident.io — that records incident open and resolve timestamps. The basic calculation: for each incident in the period, compute resolved_at − created_at, then take the mean across all incidents.

The harder part is incident-to-deployment attribution: which deployment caused this incident? This requires either manual tagging in your incident tool or automated attribution logic that looks for the most recent deployment before the incident opened. Koalr automates this attribution across GitHub, PagerDuty, OpsGenie, and incident.io.

Recording Deployments via GitHub Actions

If you want to start capturing deployment data without a full analytics platform, here is a minimal GitHub Actions step that records a deployment event after a successful production deploy:

- name: Record deployment
  if: success()
  uses: actions/github-script@v7
  with:
    script: |
      const deployment = await github.rest.repos.createDeployment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        ref: context.sha,
        environment: 'production',
        auto_merge: false,
        required_contexts: [],
        description: 'Production deploy via CI',
      });
      await github.rest.repos.createDeploymentStatus({
        owner: context.repo.owner,
        repo: context.repo.repo,
        deployment_id: deployment.data.id,
        state: 'success',
        environment_url: 'https://your-app.com',
      });

Once you have deployment events in GitHub, any analytics platform — including Koalr — can pull them to calculate deployment frequency and lead time automatically.

Industry Benchmarks by Company Size

DORA benchmarks published in the State of DevOps report are aggregate across all organizations. In practice, what "good" looks like depends heavily on your company stage, team size, and release model. A 10-person startup shipping a SaaS product should not benchmark against the same thresholds as a 2,000-person enterprise shipping an on-premises financial system.

Metric	Startup (<50 eng)	Growth (50–500 eng)	Enterprise (>500 eng)
Deploy Frequency	Multiple/day or daily	Daily to weekly	Weekly to monthly
Lead Time	<4 hours	4 hours – 1 day	1 day – 1 week
Change Failure Rate	<10%	<7%	<5%
MTTR	<4 hours	<2 hours	<1 hour

Two factors shift these ranges significantly. First: release model. Teams using feature flags and trunk-based development naturally achieve higher deployment frequency and lower lead time than teams using sprint-based branching — the instrumentation should capture that context rather than forcing everyone onto the same benchmark. Second: service criticality. A payments processing service at a growth company should have lower CFR targets than an internal admin tool at the same company.

The most useful benchmark is your own trajectory over time — are your metrics improving quarter-over-quarter? Comparing against external benchmarks gives direction, but your own trend is the real signal.

What DORA Doesn't Tell You

DORA metrics are outcome metrics. They measure what happened — not what is about to happen. This is the fundamental limitation that every engineering leader who has been using DORA for more than a year eventually encounters.

High deploy frequency ≠ low risk

A team can achieve elite deployment frequency while shipping bugs at scale. In fact, the pressure to improve deployment frequency as a metric (rather than as an outcome of better practices) often makes this worse — teams start shipping smaller changes more often without improving test coverage, review quality, or change risk assessment. The result: more deployments, more incidents, higher CFR. DORA metrics are correlated when they move together organically; they diverge when teams optimize metrics independently.

CFR is a trailing indicator — not a leading one

Change failure rate tells you that something went wrong. By the time a failure is recorded in your CFR, the incident has already happened, users have already been affected, and your on-call engineer has already been woken up. CFR is valuable for trend analysis — watching it improve or deteriorate over weeks — but it provides zero warning before a specific deployment that is about to cause an incident.

Lead time ignores what is inside the change

A PR that touches 50 files across three services, drops test coverage by 12%, and is authored by a developer who has never committed to the payment processing module before can have the exact same lead time as a one-line config change authored by the team lead who owns that file. Lead time measures pipeline speed, not change quality or risk.

MTTR measures response, not prevention

Improving MTTR is about making your incident response machinery faster: better runbooks, faster paging, better dashboards, faster rollbacks. These are all worth doing. But they are all downstream of the incident having already occurred. A team with a two-hour MTTR that ships 15 incidents per month has a worse user experience outcome than a team with a four-hour MTTR that ships two incidents per month.

This is why deploy risk prediction exists

DORA measures outcomes after the fact. Deploy risk prediction operates before the merge — scoring every PR 0–100 based on change size, author expertise, test coverage delta, review thoroughness, and historical failure patterns. It answers the question DORA cannot: is this specific change about to become a CFR data point?

Read: 7 signals that predict deployment failures before they happen →

DORA Tools Comparison

The DORA tooling market has matured substantially since 2022. Most major engineering analytics platforms now cover all four metrics. Where they differ is in what comes after DORA — the predictive, preventive layer that DORA cannot provide on its own.

Capability	Koalr	Jellyfish	Swarmia	LinearB
All four DORA metrics	✓	✓	✓	✓
Deploy risk prediction (0–100)	✓	✗	✗	✗
LLM chat on live eng data	✓	✗	✗	✗
CODEOWNERS sync & enforcement	✓	✗	✗	✗
Coverage–risk correlation	✓	✗	✗	✗
GitHub, Jira, Linear integrations	✓	✓	✓	✓

The unique capabilities in Koalr's column — deploy risk prediction, LLM chat, CODEOWNERS sync, and coverage correlation — are the features that operate upstream of DORA metrics, helping teams prevent incidents rather than measure them after they happen. They are also the features with no equivalent in any other platform in the market today.

If you are evaluating DORA tools and all four core metrics are table stakes for the shortlist, the decision comes down to what else the platform does. Does it tell you something you could not compute yourself? Does it help you act, or only report?