Engineering MetricsMarch 16, 2026 · 7 min read

How to Measure Engineering Productivity in 2026 (Without Destroying It)

Every engineering leader wants to know if their team is productive. The problem is that most of the obvious ways to measure it — commits per day, lines of code, story points completed — actively make things worse. This guide covers what to measure instead, why it works, and how to build a measurement system your engineers will actually trust.

What this guide covers

Why productivity measurement backfires, the DORA + SPACE + DevEx framework that works, 5 GitHub-measurable signals, team vs. individual metrics, AI tool adjustments, DORA tier benchmarks, and a 90-day improvement plan.

Why Measuring Productivity Is Hard: The Observer Effect

In quantum mechanics, the observer effect describes how the act of measuring a system changes what you are measuring. Engineering productivity has its own version of this problem. The moment you make a metric visible and tie it to performance, engineers optimize for the metric rather than the underlying outcome you were trying to capture.

Track commits per day and engineers make smaller, more frequent commits — not because smaller commits are better, but because the number goes up. Track story points closed and point estimates inflate over time. Track lines of code and engineers stop refactoring, because deleting code makes the number worse. Every output metric has a corresponding gaming strategy, and engineers — who are excellent at optimizing systems — will find it quickly.

This does not mean productivity is unmeasurable. It means you need to measure outcomes and system health rather than individual outputs. The rest of this guide is about how to do that.

The Wrong Way: Commits, Lines of Code, Story Points

Before covering the right approach, it is worth understanding specifically why the common proxies fail — because they are still widely used, often by well-intentioned managers who inherited them.

Commits per day rewards commit frequency, not value delivered. An engineer who commits ten small, logically coherent chunks of work looks more productive than one who commits two large, high-impact changes — even if the latter delivered ten times the business value.

Lines of code is perhaps the most thoroughly debunked metric in software engineering. It penalizes simplicity, rewards verbosity, makes refactoring invisible, and varies wildly by language, framework, and problem type. A 50-line function that solves a genuinely hard problem is worth far more than 500 lines of boilerplate. Measuring lines of code is measuring the wrong thing by definition.

Story points closed is more defensible — at least it attempts to measure completed work. But story points are relative estimates made at planning time under uncertainty. They are not a consistent unit of value. A team that inflates estimates to look productive will close the same number of story points as a team that right-sizes estimates and ships more work. The metric is too easy to game and too hard to compare across teams.

All three of these metrics share a common failure mode: they measure activity, not outcomes. Productive teams deliver working software to users quickly and reliably. The metrics you track should connect as directly as possible to that definition.

The Right Framework: DORA + SPACE + DevEx Combined

Three research-backed frameworks have emerged as the foundation for modern engineering productivity measurement. Used together, they cover outcomes, human factors, and system health in a way that is much harder to game than output metrics.

DORA Metrics

The DevOps Research and Assessment (DORA) metrics, developed through six years of research into high-performing software organizations, identify four key metrics that predict organizational performance:

  • Deployment frequency: How often an organization successfully releases to production
  • Lead time for changes: The time it takes for a code commit to reach production
  • Change failure rate: The percentage of deployments that result in a service degradation requiring remediation
  • Time to restore service: How long it takes to recover from a failure in production

DORA metrics are outcome-focused: they measure whether code is flowing to production quickly and whether that flow is stable. They cannot be easily gamed — you cannot fake a deployment frequency improvement without actually deploying more often.

SPACE Framework

Developed by researchers at GitHub and the University of Victoria, SPACE adds five dimensions that DORA alone cannot capture:

  • Satisfaction and wellbeing
  • Performance
  • Activity
  • Communication and collaboration
  • Efficiency and flow

SPACE explicitly includes developer satisfaction, which matters both as a leading indicator of productivity and as an outcome worth caring about in its own right. Dissatisfied engineers leave; attrition is one of the most expensive productivity problems a team can have.

DevEx (Developer Experience)

DevEx, formalized in a 2023 ACM Queue paper co-authored by Abi Noda, focuses on three core dimensions: feedback loops, cognitive load, and flow state. It operationalizes developer experience as something measurable — typically through regular surveys covering tool friction, interruption frequency, and ease of understanding the codebase.

The combination of all three frameworks gives you: system-level outcomes (DORA), human factors and satisfaction (SPACE), and day-to-day friction (DevEx). Together they cover the full picture of what a productive engineering organization actually looks like.

Five GitHub-Measurable Productivity Signals

Abstract frameworks are useful for orientation but you need concrete, automated metrics to track on a cadence. Here are five signals that can be measured entirely from GitHub data — no surveys, no manual collection — and that connect directly to DORA outcomes.

1. Cycle Time

Cycle time measures the elapsed time from when a PR is first committed to when it is merged and deployed. It is the closest GitHub-measurable proxy for DORA's lead time for changes.

Decompose cycle time into its component stages: coding time (first commit to PR open), review wait time (PR open to first review), review time (first review to approval), and merge-to-deploy time. Each stage has a different fix. If coding time is high, the task was too large or the engineer lacked context. If review wait time is high, you have a reviewer availability or assignment problem. Decomposition turns a single number into an actionable diagnosis.

2. Deployment Frequency

Deployment frequency is a direct DORA metric. Tracked from GitHub deployments or CI/CD pipeline events, it tells you how often working software actually reaches production. Elite teams deploy multiple times per day. High-performing teams deploy daily to weekly. Teams deploying less than once per month have a flow problem worth investigating.

Deployment frequency tends to be a lagging indicator of other improvements. When cycle time drops and PR size shrinks, deployment frequency rises naturally — because smaller changes flow faster through the pipeline.

3. PR Review Time

Time from PR open to first substantive review comment. This is a narrower slice of cycle time but worth tracking separately because it is the most actionable bottleneck for most teams. When first-review time exceeds four hours, context-switching costs accumulate and engineers spend significant time re-engaging with stale work rather than making forward progress.

4. Rework Rate

Rework rate measures the proportion of code changed within a short window after initial merge — typically 14 to 30 days. High rework rate is a signal that code is being shipped before it is ready, that requirements were unclear, or that test coverage is insufficient to catch regressions before they reach production.

Rework is expensive not just because the original work was wasted but because rework disrupts other work in progress. Every bug fix or rushed patch is a context switch that pulls an engineer away from planned work and creates new merge conflict risk.

5. Test Coverage Delta

Rather than tracking absolute test coverage (which tells you very little on its own), track coverage delta per PR: is the PR adding, maintaining, or reducing coverage of the files it touches? Coverage delta is a leading indicator of quality debt accumulation. PRs that consistently reduce coverage are building a fragile codebase that will generate future rework. Coverage delta gates — requiring net-neutral or positive coverage — are one of the most cost-effective quality investments a team can make.

SignalDORA LinkElite TargetAt-Risk
Cycle timeLead time for changes<1 day>7 days
Deployment frequencyDeployment frequencyMultiple/day<1/month
PR review timeLead time (review stage)<4 hours>72 hours
Rework rateChange failure rate<5%>20%
Test coverage deltaChange failure rate (leading)Net positiveConsistent decline

Team vs. Individual Metrics

This is one of the most important design decisions in any engineering measurement program, and the default answer should be: measure at the team level.

Individual productivity metrics create a set of well-documented problems. Engineers optimize for their individual numbers rather than team outcomes. Collaboration suffers because helping a colleague costs you time that could have gone into your own output metrics. Knowledge sharing slows because explaining things to others does not show up in any individual metric. Harder problems get avoided because they take longer and produce fewer measurable outputs.

Team-level metrics align incentives correctly. If the team's cycle time is high, everyone has an incentive to help the slowest PRs move through review. If the team's deployment frequency is low, everyone has an incentive to break work into smaller, independently deployable pieces. The team wins together or loses together.

There are legitimate use cases for individual-level visibility. Engineers should be able to see their own metrics — not as a performance evaluation tool but as a personal improvement feedback loop. Managers may need individual visibility to identify coaching opportunities or to recognize when someone is struggling. The key distinction is visibility for development versus visibility for evaluation. Individual metrics used for performance evaluation create the incentive problems above. Individual metrics used as a self-improvement tool, or shared privately between a manager and an engineer in a coaching context, can be genuinely useful.

A reasonable default policy: all metrics are visible at the team level by default. Individual breakdowns are visible to the individual and their manager. No individual metric is used directly in performance evaluations without significant context and human judgment.

How AI Tools (Copilot, Cursor) Change Productivity Measurement

AI coding assistants are now a significant variable in engineering productivity, and most existing measurement frameworks were not designed with them in mind. Several adjustments are worth making.

Output velocity will increase, but so will review burden. Engineers using AI tools write more code faster. This is generally positive for deployment frequency and cycle time at the coding stage. But it shifts the bottleneck: AI-generated code still needs review, and reviewers are not moving faster just because the author was. PR review time may actually increase on teams that adopt AI tools without adjusting their review process, because more code is being submitted without a corresponding increase in reviewer capacity.

Coverage delta becomes more important, not less. AI-generated code often lacks test coverage. Engineers using Copilot or Cursor to generate implementations may not be generating corresponding tests. Tracking coverage delta per PR catches this drift before it becomes a quality problem. If AI adoption on your team correlates with declining coverage delta, the fix is explicit coverage gates in CI, not reducing AI tool usage.

Rework rate is your AI quality signal. Code that passes review but is incorrect or incomplete will show up as rework — changes to AI-generated code shortly after merge. Tracking rework rate by author and by PR tag (teams that tag AI-assisted PRs) lets you measure whether AI tool usage is correlated with higher or lower downstream rework, which is the real question about AI code quality.

AI-authored code percentage is itself a metric. Knowing what fraction of merged code was AI-generated (estimated from Copilot telemetry or AI comment patterns) gives you context for interpreting other metrics. A team with 60% AI-generated code that has stable rework rates and improving coverage delta is using AI well. The same team with rising rework and declining coverage is accumulating risk.

DORA Tier Benchmarks: What Good Looks Like

The DORA research identifies four performance tiers: Elite, High, Medium, and Low. These are derived from the State of DevOps Report across thousands of organizations, giving them strong empirical backing as a benchmarking reference.

MetricEliteHighMediumLow
Deployment frequencyMultiple/dayDaily–weeklyWeekly–monthly<1/6 months
Lead time for changes<1 hour1 day–1 week1 week–1 month>6 months
Change failure rate0–5%5–10%10–15%46–60%
Time to restore service<1 hour<1 day1 day–1 week>1 week

One important nuance: DORA tier membership is a team characteristic, not an organizational one. Large organizations frequently have elite-tier teams and low-tier teams coexisting. The benchmark is most useful as a team-by-team diagnostic rather than an organization-wide label.

If you are below High on any metric, it is worth understanding the constraint before setting a target. Teams with low deployment frequency are usually bottlenecked by release process, not development speed. Teams with high change failure rate usually have a testing or review quality problem, not a speed problem. Moving up DORA tiers requires addressing the right constraint, not just setting a higher target number.

A 90-Day Measurement Improvement Plan

Rolling out productivity measurement to an engineering organization is as much a cultural challenge as a technical one. Engineers who have been burned by measurement systems used punitively will be skeptical. The following phased approach builds trust while getting you to a working measurement system in 90 days.

Days 1–30: Baseline Without Judgment

The first month is about measurement only — no targets, no reviews, no sharing of individual data. Instrument your GitHub data to capture cycle time, deployment frequency, PR review time, rework rate, and coverage delta. Run DORA calculations for the past 90 days to establish a historical baseline. Share only team-level aggregates, and frame the entire exercise as learning where you are starting from, not evaluating anyone.

Communicate explicitly: these metrics are not being used in performance reviews. This communication needs to come from engineering leadership and be repeated multiple times to be believed. If engineers have reason to doubt it — because measurement has been used punitively in the past — the skepticism is rational and you should expect it.

Days 31–60: Identify the Constraint

With baseline data in hand, identify the single biggest constraint in your delivery pipeline. Is most of the cycle time in review wait time? Then first-review latency is your focus. Is change failure rate high? Then test coverage or review quality needs attention. Is deployment frequency low despite fast cycle times? Then release process is the bottleneck.

Pick one metric to improve. Set a team target — not an individual one — and design a specific intervention. If review wait time is the constraint, the intervention might be CODEOWNERS enforcement and a four-hour review SLA. If change failure rate is the constraint, it might be a coverage delta gate in CI. Run the intervention for 30 days and measure the result.

Days 61–90: Formalize and Expand

By day 60 you should have a baseline, a tested intervention, and some evidence of whether the intervention worked. Use this as the foundation for a more formal metrics review cadence: a monthly engineering metrics review where the team looks at all five signals together, celebrates improvements, and identifies the next constraint to address.

Expand the framework to include developer experience signals — either through a brief monthly survey (five questions, five minutes) or through a structured DevEx interview process. The combination of quantitative GitHub metrics and qualitative developer experience signals gives you the full picture: are the numbers improving and do engineers feel like things are improving? When those two diverge, the divergence is itself informative.

By day 90 you should have: a repeatable measurement system, one demonstrated improvement from an evidence-based intervention, and a team that has seen metrics used constructively rather than punitively. That foundation makes everything that follows easier.

Get your DORA metrics automatically from GitHub

Koalr connects to your GitHub repositories and calculates cycle time, deployment frequency, change failure rate, and all five productivity signals automatically — no manual data collection, no spreadsheets. Your team's DORA baseline is ready within minutes of connecting your first repository.