Engineering MetricsMarch 18, 2026 · 10 min read

Why Your DORA Metrics Are Lying to You (And How to Fix It)

Most engineering teams implementing DORA metrics for the first time make the same set of measurement mistakes. The numbers come out, they look reasonable, and the team starts tracking trends — not realizing that the underlying data has systematic errors that make the metrics meaningless or, worse, actively misleading. Here are the four most common mistakes and what to do about each one.

The stakes of bad DORA data

When DORA metrics are calculated incorrectly, engineering leaders make investment decisions — in tooling, headcount, process change — based on a signal that does not reflect reality. The cost is not just incorrect numbers on a dashboard. It is misallocated improvement effort.

Mistake 1: Counting PR Merges Instead of Deployments

This is the most common mistake, and it produces systematically optimistic deployment frequency numbers. A PR merged to main is not a production deployment. If your team merges 15 PRs per day but deploys to production once per day (after a CI pipeline finishes), your deployment frequency is once per day — not 15 times per day.

Teams make this mistake for understandable reasons: GitHub PR data is easy to query, and PR merges feel like meaningful delivery events. And for some teams — particularly those with continuous deployment where every merge to main triggers an immediate production deploy — PR merges and deployments are nearly equivalent. But for any team with a batch deploy step, a deployment environment pipeline (dev → staging → production), or a human approval gate before production, they are not the same.

How to fix it

Define a deployment as a production deployment specifically. Use GitHub Deployments API filtered to environment=production, or parse GitHub Actions workflow runs for jobs named after your production deploy step. If you use Vercel, Railway, or ArgoCD, pull the deployment events from those platforms. The key is that "deployed to production" must be the event you count — not "merged to main."

Diagnostic: Compare your raw PR merge count per day to your deployment event count per day. If they match, you are either doing true continuous deployment (correct) or counting PRs as deployments (incorrect). Interview your on-call engineers — ask them when they would consider a change "live." If the answer is different from the event you are counting, fix your data source.

Mistake 2: Using UTC Timestamps Without Timezone Adjustment

This error is subtle but consequential, particularly for change failure rate analysis and any deployment timing work. GitHub stores all timestamps in UTC. A deployment at 4:30pm on a Friday in San Francisco is stored as 00:30 Saturday UTC. If you are doing day-of-week analysis — looking at which days have higher CFR, or whether Friday afternoon deploys fail more often — UTC timestamps will attribute Friday failures to Saturday and distort the pattern.

For global teams with engineers in multiple time zones, the problem compounds: a deployment at 9am IST (Indian Standard Time) is 3:30am UTC. Your "morning deployment cluster" in the data may actually be spanning two calendar days when viewed by the team making the changes.

How to fix it

Always store and display timestamps in the local timezone of the team responsible for the deployment. For multi-timezone teams, use the primary office timezone or the timezone where the on-call rotation is based. When computing day-of-week or time-of-day distributions for risk analysis, convert to local time before bucketing.

In Python: datetime.astimezone(pytz.timezone('America/Los_Angeles')). In JavaScript: new Date(ts).toLocaleString('en-US', {timeZone: 'America/Los_Angeles'}). Store the timezone alongside the timestamp in your metrics database so you can recompute if the team's primary timezone changes.

Mistake 3: Measuring Repository-Level Frequency, Not Service-Level

A monorepo that deploys five microservices from a single repository will show wildly different deployment frequency numbers depending on whether you measure at the repo level or the service level. If the monorepo has one CI trigger per merge to main, but that trigger deploys three of five services (only the ones affected), measuring at the repo level overstates frequency for some services and understates it for others.

This matters because the DORA research correlates service-level deployment frequency with service-level stability. The correlation breaks down when you aggregate across unrelated services in a single measurement bucket. A high-frequency internal tooling service and a low-frequency payment processing service average to a medium-frequency meaningless number.

How to fix it

Tag deployments with the service or component they affect. In GitHub Deployments API, use the task or description field to specify which service the deployment is for. In GitHub Actions, create separate deployment job steps per service and record separate deployment events per service. Aggregate DORA metrics at the service level, then roll up to team and organization level with appropriate weighting.

ScenarioRepo-Level FrequencyService-Level (Correct)
Monorepo, 5 services, 10 deploys/day10/day2/day per service (avg)
Polyrepo, 5 repos, each deploys 2/day2/day (per repo)2/day per service (correct)
Single repo, mixed-frequency services5/day (misleading average)frontend: 10/day, payments: 0.5/day

Mistake 4: Cherry-Picking the Lookback Window

This is the measurement error that is hardest to detect from outside the team, because it is often unconscious rather than deliberate. DORA metrics are sensitive to the lookback window. A team that had a bad month in January but a good February will look excellent if you measure the last 30 days in March. The same team looks poor if you measure a rolling 90-day window.

The problem with selective windows is that trend analysis becomes meaningless. If your engineering leadership reviews DORA metrics monthly and the window always resets, you can have consistently poor performance masked by naturally occurring variance within each window. The metric appears to oscillate around a stable average while the underlying practices never change.

How to fix it

Commit to a fixed lookback window and never change it retroactively. The DORA recommendation is a rolling 30-day window for operational awareness and a rolling 90-day window for trend analysis. Track both. Display them side by side. Never use a window shorter than 14 days — the variance from a two-week sample is too high to draw meaningful conclusions from DORA metrics.

Additionally, display the metric trend rather than just the current value. A deployment frequency of 2.3 per day is meaningless without context. A deployment frequency that has improved from 0.8 per day six months ago to 2.3 per day today is a signal worth celebrating and investigating. A deployment frequency that dropped from 4.1 per day to 2.3 per day over the same period is a signal worth investigating for a different reason.

Bonus Mistake: Using Mean Instead of Median for Lead Time

Lead time distributions are heavily right-skewed. The vast majority of PRs — the small bug fixes, dependency updates, and incremental features — have lead times of 30 minutes to 4 hours. But every team has occasional large refactors, infrastructure migrations, or multi-week feature branches that merge with a lead time of days or weeks.

If you calculate mean lead time, these outliers dominate the result. A team with 95% of changes under 2 hours and 5% of changes over 5 days will have a mean lead time that looks alarming, even though the typical developer experience is excellent. Median lead time correctly shows the experience of the typical change.

Use median for central tendency and p90 (90th percentile) for tail behavior. Report them together: "Median lead time: 1.5 hours, p90: 18 hours." The p90 catches the chronic outliers — the kinds of changes that should probably be broken up — without letting them distort the median.

Koalr handles all of these correctly by default

Koalr measures deployment frequency from actual GitHub Deployment events (not PR merges), converts all timestamps to team timezone, segments metrics by service, uses rolling 30-day and 90-day windows, and reports median lead time with p90 tail. The instrumentation is correct by default so your team can focus on improving the metrics rather than debugging whether they are calculated right.

A Checklist for Validating Your DORA Implementation

  • Deployment frequency: Does the count match the number of times your on-call engineer would say "we deployed today" if asked? If not, you are counting the wrong event.
  • Lead time: Pick 5 recent PRs. Manually calculate lead time for each (merge timestamp to deploy timestamp). Do they match your tool? If not, your SHA correlation is broken.
  • Change failure rate: Ask your on-call engineers to name the last three incidents caused by deployments. Do those incidents appear in your CFR data? If not, your incident attribution is incomplete.
  • MTTR: Pick 3 recent incidents from your incident tool. Do the durations in your MTTR calculation match the actual timeline? Confirm your "resolved" timestamp definition matches when the engineers actually closed the incident.
  • Timezones: If your team is in UTC-8 (PST), does your data show more Friday deployments or Saturday deployments? If Saturday, your timezone conversion is wrong.

Get DORA metrics you can trust

Koalr connects to GitHub, PagerDuty, OpsGenie, and incident.io and calculates DORA metrics with correct service-level segmentation, timezone handling, and deployment event attribution. No manual implementation required.