Engineering MetricsMarch 16, 2026 · 12 min read

Engineering Efficiency Metrics: Beyond Story Points and Velocity

Story points. Commit counts. Lines of code. These are the metrics engineering leaders reach for when they need to show the business that the team is working hard. They are also three of the most misleading signals you can track. Here is how to replace them with a framework that actually reflects team health — without creating the gaming behavior that kills morale.

What this guide covers

Why traditional productivity metrics fail, the three-category framework (Output, Outcome, Process), which specific metrics belong in each, Goodhart's Law in engineering, building a balanced scorecard with DORA + SPACE, and how to present efficiency data to leadership without incentivizing gaming.

Why Traditional "Productivity" Metrics Fail

Story points were designed as a planning tool — a way for teams to estimate relative effort so they could fit work into a sprint. They were never designed as a performance metric. When organizations start tracking story points delivered per developer per sprint, two things happen reliably: inflation (teams calibrate estimates upward to hit targets) and quality degradation (hitting the number becomes more important than the number being meaningful).

Lines of code and commit counts have the same problem at a worse scale. A junior developer who writes 500 lines of tangled, untested code is not five times more productive than a senior who rewrites it as 100 lines. A developer with 30 commits per week who is splitting obvious changes into atomic micro-commits to look active is not three times more productive than a teammate with 10 thoughtful, well-described commits.

The fundamental failure of these metrics is that they measure activity, not value. A team can be extremely busy — high story point velocity, constant commits, full sprints — while delivering nothing customers care about, accumulating technical debt, and burning out their best engineers. The numbers look great in the board deck. The product does not.

The Right Framework: Output, Outcome, and Process

A useful engineering metrics framework separates signals into three categories, each answering a different question:

  • Output metrics — What did the team ship? Counts of work items completed. Necessary but not sufficient.
  • Outcome metrics — Did what the team shipped work reliably and create value? Quality and impact of the output.
  • Process metrics — How efficiently is work flowing through the system? Speed, bottlenecks, and waste in the delivery pipeline.

No single category is sufficient on its own. High output with poor outcomes means the team is shipping bugs fast. Good outcomes with poor process means the team is reliable but slow. Efficient process with low output means the pipeline is smooth but underloaded. You need signal from all three to understand whether an engineering organization is actually performing well.

Output Metrics: Throughput Without Inflation

Output metrics should count units of completed work that have a consistent, team-independent definition. The key word is consistent — if developers can change the definition by adjusting their behavior, the metric is gameable.

PRs merged per sprint is the most reliable output metric available from GitHub. A PR is either merged or it isn't — there is no inflation mechanism. Pair it with median PR size (lines changed) to distinguish between high-throughput teams making meaningful changes and teams artificially splitting work.

Deployment frequency measures how often the team successfully ships to production. It is the DORA throughput metric and one of the strongest predictors of business performance in the research literature. Unlike story points, it requires an actual working deployment — there is no way to get credit for deploying without deploying.

Issues closed is useful as a supplement to PRs merged, particularly for teams using issue trackers tightly. The caveat: issues should be reasonably sized before you start measuring closure rate, or teams will close issues by splitting them into trivial sub-items.

Outcome Metrics: Did It Actually Work?

Outcome metrics measure the quality and business impact of what was shipped. These are the metrics that connect engineering work to customer experience.

Change failure rate (CFR) is the percentage of deployments that caused a service degradation requiring rollback, hotfix, or incident response. It is the DORA stability metric and the clearest signal that output quality is deteriorating. A team pushing CFR above 15% is shipping so unreliably that increased output is actively counterproductive — more deploys means more incidents.

Mean time to restore (MTTR) measures how quickly the team recovers when failures do occur. It reflects the quality of observability, runbooks, on-call processes, and rollback capability. Elite teams recover in under an hour. Low-performing teams average over a week.

Customer-reported bugs per release is a lagging indicator of quality that connects directly to customer satisfaction. It is harder to instrument than CFR (requires linking support tickets to releases) but captures the class of failures that do not trigger internal incidents — the bugs users hit silently and then churn over.

Rework rate — the percentage of PRs that fix defects in code merged within the last 30 days — is a leading indicator of quality debt accumulation. High rework rate means the team is shipping bugs fast and then fixing them, a pattern that is expensive and demoralizing.

Process Metrics: Where Is the Work Getting Stuck?

Process metrics reveal bottlenecks and waste in the delivery pipeline. They answer questions like: Why does it take three weeks for a PR to go from open to production? Where are engineers spending time that is not productive?

PR cycle time is the elapsed time from PR opened to PR merged. It decomposes into four sub-intervals that are diagnostically distinct: time to first review, review-to-approval duration, time from approval to merge, and merge-to-deployment lag. Each interval points to a different type of bottleneck — reviewer availability, review quality standards, merge anxiety, or CI pipeline slowness.

Review queue depth measures how many open PRs are waiting for review at any given moment. A deep review queue is a compounding tax on the whole team — PRs grow stale, merge conflicts accumulate, and engineers switch context to start new work while they wait. Review queue depth above five per reviewer is typically where cycle time starts degrading measurably.

WIP count (work-in-progress) measures how many items each developer has in-flight simultaneously. WIP limits are a core principle of lean software development — multitasking has a compounding cost on throughput and quality. Teams with average WIP above three per developer typically have worse cycle times, higher rework rates, and lower output quality than teams with WIP closer to one.

Flow efficiency measures the ratio of active work time to total elapsed time for a work item. A cycle time of five days with two days of active work is 40% flow efficiency — the item spent 60% of its time waiting. Improving flow efficiency does not require working faster; it requires eliminating handoffs, queue waits, and context-switch overhead.

Goodhart's Law in Engineering

British economist Charles Goodhart articulated the principle in 1975: when a measure becomes a target, it ceases to be a good measure. In engineering, this plays out reliably every time a single metric is elevated to a performance target.

Set a deployment frequency target and teams start shipping one-line no-op deployments to hit the number. Set a PR count target and developers split every change into ten micro-PRs. Set a cycle time target and engineers start bypassing code review. Set a story point velocity target and velocity inflation starts immediately.

The antidote is not to avoid metrics — it is to use a balanced portfolio of metrics across all three categories such that gaming any one metric is visible in the others. A developer splitting PRs to inflate count shows up as an anomalous PR size distribution. A team bypassing review to hit cycle time shows up in CFR and rework rate. The metrics cross-check each other.

The single-metric trap

If you report only one metric to leadership, you will optimize only that metric — and every optimization will come at the expense of something unmeasured. Balance is not a nice-to-have; it is the only structural defense against Goodhart's Law.

Building a Balanced Scorecard with DORA + SPACE

The DORA framework (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) covers the delivery pipeline well but says nothing about developer experience, satisfaction, or the human cost of your current process. The SPACE framework — developed by researchers at Microsoft, GitHub, and the University of Victoria — extends the picture with five dimensions: Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, and Efficiency and flow.

A practical balanced scorecard for most engineering organizations combines the two:

CategoryMetricSource
OutputDeployment frequencyGitHub Deployments API
OutputPRs merged / sprintGitHub Pull Requests API
OutcomeChange failure rateGitHub + PagerDuty / incident.io
OutcomeMTTRPagerDuty / OpsGenie / incident.io
OutcomeRework rateGitHub Pull Requests API
ProcessPR cycle timeGitHub Pull Requests API
ProcessLead time for changesGitHub Deployments API
ProcessReview queue depthGitHub Pull Requests API
ExperienceDeveloper satisfaction (eNPS)Survey (quarterly)

How to Present Efficiency Metrics to Leadership

The biggest risk of surfacing engineering metrics to non-technical leadership is that a single bad number triggers a simplistic response — "cycle time is too high, tell engineers to work faster" — that makes the underlying problem worse.

Three practices reduce this risk:

Present trends, not snapshots. A single data point has no context. A twelve-week trend shows whether things are improving, holding steady, or deteriorating — and gives leadership something actionable to discuss. Never show a single number without the trendline.

Frame metrics as system properties, not individual performance. PR cycle time is a property of the team's review process and WIP levels, not a measure of how hard individual developers are working. When metrics are presented as system health indicators rather than performance scores, leaders are less likely to use them punitively — and engineers are less likely to game them.

Pair every metric with its recommended lever. Do not just report that change failure rate increased 4 percentage points last quarter. Report that it increased, explain that the leading signal is PRs with high deployment risk scores bypassing review, and propose the specific change — adding a required review step for high-risk PRs — that addresses the root cause. Metrics without recommended actions create anxiety; metrics paired with actions create alignment.

Koalr identifies your bottlenecks automatically

Connect GitHub, Jira, and your incident tool. Koalr's recommendations page analyzes your Output, Outcome, and Process metrics together — and surfaces the specific bottlenecks that are costing your team the most throughput and reliability.