Why Engineering KPIs Are Hard to Get Right

Engineering is one of the last functions in a company where executives are expected to make major resource and strategy decisions with almost no quantitative grounding. Sales has pipeline metrics. Finance has variance reports. Marketing has CAC and ROAS. But ask most VPs of Engineering how they know whether their team is performing well, and the honest answer is: gut feel plus anecdote.

The reason is not that engineering is unmeasurable. It is that the obvious metrics measure the wrong things. Lines of code, commits per day, story points completed — these are activity metrics. They measure input, not output. A team can burn through story points shipping features that never get used. A developer can commit fifty times in a week while refactoring code that breaks the build. A high commit count from a junior engineer working on a new service can coexist with a senior engineer silently writing 20,000 lines of the most valuable infrastructure the company has ever had.

Outcome metrics fix this. They measure what the engineering function actually produces for the business: reliable software that ships on a predictable cadence and recovers quickly when things go wrong. The DORA research program spent a decade demonstrating that four specific outcome metrics — deployment frequency, lead time for changes, change failure rate, and mean time to restore — predict business performance better than any activity proxy.

The best KPI framework uses both. Activity metrics provide operational visibility into what engineers are spending time on. Outcome metrics tell you whether that time is translating into results. Track outcomes primarily; use activity metrics to explain why outcomes are moving.

Tier 1 — Delivery Velocity (DORA)

The DORA metrics are the most important engineering KPIs for VPs and CTOs because they are the only engineering metrics with a proven causal link to business outcomes. The research shows that elite DORA performers are 2.6x more likely to exceed profitability and revenue goals than low performers. These are not correlations — they hold across industry, company size, and technology stack.

For a deeper treatment of how to instrument these from GitHub and incident tool data, see the complete DORA metrics guide.

KPI 1 — Deployment Frequency

How often does your team successfully deploy to production? Deployment frequency is the primary measure of your release cadence. High frequency — achieved safely, with low change failure rate — signals a team that has mastered continuous delivery: short-lived branches, automated testing, feature flags, and a deployment pipeline that does not require heroics.

What to track: Successful production deployments per day or per week, per service and in aggregate. Trend direction matters more than absolute number — a team moving from monthly to weekly releases is making meaningful progress even if it has not reached the elite tier.

KPI 2 — Lead Time for Changes

How long does it take a code change to travel from first commit to production? Lead time measures the throughput of your delivery pipeline. Short lead time means the team can respond quickly to customer feedback, security vulnerabilities, and business requirements. Long lead time is a compounding organizational risk: slow feedback loops reinforce bad technical decisions.

What to track: Median time from PR merge to production deployment. Use the median, not the mean — lead time distributions are right-skewed by large infrastructure changes that should not drag down the baseline reading. Separate microservices with different release models into their own cohorts.

For a deep dive into why lead time improvements often stall and how to unblock them, see engineering velocity tracking.

KPI 3 — Change Failure Rate

What percentage of your deployments cause a production incident or require a rollback? Change failure rate is the primary stability metric in the DORA framework and the one most directly correlated with engineering quality culture. A high CFR means your delivery pipeline is not catching problems before they reach users.

What to track: Failed deployments divided by total deployments, expressed as a percentage. A deployment counts as failed if it triggered a P0/P1 incident, required a rollback, or was followed by a hotfix within 24 hours. Standardize this definition across your organization before you start measuring — inconsistent definitions make CFR trends meaningless.

KPI 4 — Mean Time to Restore (MTTR)

When something breaks in production, how fast does your team restore service? MTTR is the paired complement to change failure rate. CFR tells you how often you fail; MTTR tells you how badly you fail when you do. A team with a 2% CFR and a 6-hour MTTR has a better reliability profile than a team with a 1% CFR and a 48-hour MTTR.

What to track: Median time from incident open to incident resolved (service restored to SLO), calculated from your incident management platform — PagerDuty, OpsGenie, or incident.io. Track P50 for the baseline and P90 to understand tail behavior.

Metric	Elite	High	Medium	Low
Deployment Frequency	Multiple/day	Daily–weekly	Weekly–monthly	<Monthly
Lead Time for Changes	<1 hour	1 hour–1 day	1 day–1 week	1 week–1 month
Change Failure Rate	0–5%	5–10%	10–15%	>15%
MTTR	<1 hour	<1 day	<1 week	1 week+

Tier 2 — Code Quality

Code quality metrics are the leading indicators for DORA stability metrics. A rising change failure rate rarely appears out of nowhere — it is almost always preceded by weeks of degrading code quality signals: PR sizes growing, test coverage falling, review participation dropping. Track these metrics to catch quality problems before they become incident problems.

KPI 5 — PR Cycle Time

PR cycle time measures the total time from when a pull request is opened to when it is merged, broken down into three sub-components: coding time (PR opened to first substantive review comment), review wait time (first comment to approval), and merge time (approval to merge). The breakdown is more useful than the total — different bottlenecks require different interventions.

Benchmark: For most teams shipping production software, a healthy P50 cycle time is 4–8 hours for small PRs and 1–2 days for medium PRs. P50 cycle time exceeding 3 days consistently signals review process friction or a WIP problem upstream.

KPI 6 — PR Size Distribution

Smaller pull requests merge faster, get more thorough reviews, and cause fewer incidents. Research consistently shows that PR size is one of the strongest predictors of change failure rate — large PRs are harder to review accurately, touch more system boundaries, and are more likely to have unintended interactions between changes.

What to track: Median lines changed per PR (net additions + deletions). The target varies by codebase, but most high-performing teams aim for a median below 200 lines changed. Track the distribution, not just the mean — a small number of very large PRs (infrastructure migrations, major refactors) should not pull the median above the target for the rest of the team.

See how PR size correlates with deployment failures for the data behind this metric.

KPI 7 — Test Coverage Trend

Test coverage is most useful as a directional signal, not an absolute target. A team at 40% coverage and trending upward is in better shape than a team at 75% coverage and trending down. Coverage gates in CI enforce a floor; what matters to track at the executive level is the monthly trend.

What to track: Percentage of code lines covered by automated tests, measured per service, trended monthly. Flag any service where coverage drops more than 3 percentage points in a single sprint — that is a leading indicator of technical debt accumulation that will surface as incidents within 60–90 days.

KPI 8 — Rework Rate

Rework rate measures the percentage of commits that revert, hotfix, or patch code that was committed within the previous 30 days. It is a direct measure of how much engineering capacity is being consumed by fixing recent mistakes rather than delivering new value.

Benchmark: Elite teams keep rework rate below 5% of total commits. A rework rate above 15% is a serious signal — it means roughly one in every seven commits is cleaning up a recent error. At that rate, the team is spending significant capacity on self-created technical debt rather than customer value.

For detailed instrumentation guidance, see measuring rework rate from GitHub commit data.

KPI 9 — Change Failure Root Cause Distribution

Once you are tracking change failure rate, the next question is: where are failures coming from? Infrastructure failures (server, network, cloud service), code bugs, configuration errors, and dependency problems each require different interventions. A team with a 10% CFR driven entirely by a flaky third-party dependency needs a different response than a team with the same CFR driven by insufficient test coverage.

What to track: Incident root cause tags from your incident management platform, aggregated monthly. Require engineers to tag every P0/P1 incident with a root cause category (infrastructure, code, config, dependency, human error) during the postmortem. The distribution over time shows where investment in prevention will have the highest return.

Tier 3 — Team Health and Throughput

Team health metrics are the leading indicators for delivery velocity. If engineers are spending more time in meetings than coding, if PR review is concentrated in two reviewers, or if response time for reviews is growing, the DORA metrics will follow downward within the next quarter. These metrics also surface individual contributors who are overloaded, blocked, or disconnected from the team — issues that typically go invisible until someone quits.

KPI 10 — Active Contributor Count

Active contributors are engineers who made at least one substantive commit in the measurement period. This metric surfaces the difference between organizational headcount and actual coding capacity — a team with 20 engineers but only 12 active contributors is a team with 8 people absorbed in meetings, on-call rotations, or non-coding work.

What to track: Number of engineers who committed code at least once per week, trended over rolling 4-week windows. A sustained decline in active contributors while headcount stays flat is one of the clearest early signals of organizational dysfunction.

For more on measuring individual contributor patterns, see developer experience metrics.

KPI 11 — PR Throughput Per Engineer

PR throughput measures the number of pull requests merged per engineer per week. It is the most reliable measure of individual and team delivery capacity — more reliable than commit count (which can be gamed), story points (which vary by estimation culture), or lines of code (which rewards verbosity).

Benchmark: For full-stack product engineers, 2–5 merged PRs per week is a healthy range. Below 1 per week consistently signals blocked work, large PR batching, or low coding activity. Above 8 per week often signals very small PRs that may not be getting substantive review. Track by team and by engineer to identify outliers in both directions.

KPI 12 — Review Participation Rate

Review participation rate measures the percentage of merged PRs that received at least one substantive review comment — as opposed to a rubber-stamp approval with no discussion. High review participation is associated with lower change failure rate, faster knowledge transfer, and stronger team ownership culture.

What to track: Percentage of merged PRs with one or more review comments (not just approvals). Target above 80%. Below 50% means roughly half your code is shipping without any peer review — an unacceptable quality risk for any team deploying to production more than once a week.

KPI 13 — Reviewer Response Time

Reviewer response time is the median elapsed time from when a PR is opened (or moved to "ready for review") to when it receives a first substantive review comment. It is the primary bottleneck in most teams' lead time that is not visible in DORA metrics — a PR can sit unreviewed for two days with zero impact on deployment frequency if it was already waiting in a queue.

Benchmark: P50 reviewer response time under 4 hours is achievable for most teams without sacrificing review quality. P50 above 24 hours is a workflow problem — engineers are context-switching away from reviews, or review assignments are unclear. P90 above 48 hours indicates systemic review bottlenecks that will eventually show up in lead time and developer satisfaction.

Tier 4 — AI Tool Adoption

AI coding assistants have moved from experimental to mainstream in 2025–2026. Teams that are not measuring AI tool adoption and effectiveness are operating blind in a domain that is now a meaningful driver of engineering productivity. More importantly: AI adoption rate has become a competitive signal for both recruiting and investor conversations. Teams with high AI adoption attract engineers who want to work with the best tools; teams with measurable AI productivity gains have a story to tell in board meetings.

KPI 14 — AI Coding Assistant Adoption Rate

AI adoption rate is the percentage of engineers on your team actively using a coding assistant — Copilot, Cursor, Codeium, or equivalent — as measured by API activity, not self-reported survey data. Self-reported adoption is typically 20–30 percentage points higher than actual usage because engineers conflate "I have it installed" with "I actively use it."

Track three sub-metrics: overall adoption rate (percentage of engineers with activity), AI code acceptance rate (accepted suggestions divided by total suggestions shown, benchmarked at 25–35% for Copilot in typical codebases), and estimated hours saved from AI assistance based on acceptance rate and average suggestion size.

KPI 15 — AI-Attributed Code as a Percentage of Total Output

As AI acceptance rates rise, a meaningful portion of committed code at high-adoption teams is AI-generated or AI-assisted. Tracking AI-attributed lines as a percentage of total committed lines — per engineer and in aggregate — gives you the most granular view of where AI is adding productivity and whether AI-generated code is correlated with higher or lower change failure rate (the research is still emerging, but the question is strategically important).

Board-level framing: Present AI adoption as a competitive advantage metric. A team at 85% adoption with a 30% acceptance rate is compressing per-engineer output by a meaningful margin. That compression translates directly into faster feature velocity or higher capacity headroom at the same cost basis.

Tier 5 — Operational Excellence

Operational excellence metrics measure how well your engineering organization runs its production systems. They are distinct from the DORA metrics in that they focus on the ongoing cost of production operations — on-call burden, error budget health, incident distribution — rather than the delivery pipeline itself. For teams with SLOs and significant on-call load, these metrics belong on the VP dashboard.

Deployment Success Rate

The complement of change failure rate (1 − CFR). A cleaner framing for executive audiences — "97% of our deployments succeed" lands better than "our change failure rate is 3%." Track per service and in aggregate.

On-Call Burden

Alerts per engineer per on-call shift. A sustained rate above 5 actionable alerts per shift indicates alert fatigue. Track by service to identify which systems are generating disproportionate on-call load.

Incident MTTR by Service

Disaggregate MTTR by service rather than viewing only the organizational aggregate. A single service with a 24-hour MTTR can drag the organizational median while hiding the fact that all other services recover in under an hour.

Error Budget Remaining

Percentage of monthly SLO error budget remaining as of the current date. A team burning through error budget in week 2 of 4 needs to reduce deployment frequency or incident rate — the budget remaining metric makes that conversation concrete.

What Not to Track — Engineering KPI Anti-Patterns

Every VP of Engineering who has been in the role for more than a year has a list of metrics they tried, regretted, and stopped tracking. The following are the most common anti-patterns, each with a brief explanation of why they fail.

Story Points Completed

Story points measure planned capacity consumption, not output value. They are negotiated numbers that reflect how a team estimates work at a point in time — and every team that tracks velocity as a KPI eventually discovers that velocity goes up because estimation inflation compensates for actual slowdowns. Story points are a useful planning tool within a team; they are a poor performance metric across teams and a terrible executive dashboard metric.

Lines of Code Written

Lines of code is the most consistently debunked engineering metric in the industry, yet it keeps appearing in naive KPI frameworks. More code means more complexity, more surface area to maintain, and more potential for bugs. The most valuable engineering work often removes lines of code — a refactor that simplifies a complex system, deletes dead code paths, or replaces hand-rolled infrastructure with a managed service. Measuring LOC output rewards verbosity and punishes quality.

Commits Per Day

Commit frequency without context is meaningless. A developer who commits twenty tiny WIP checkpoints per day looks better on this metric than a developer who commits once per day with carefully scoped, production-ready changes. Commit frequency also varies significantly based on branching strategy, editor tooling, and personal workflow preference. It predicts nothing about delivery speed, quality, or impact.

Velocity Trends Without Quality Signals

Tracking deployment frequency or lead time in isolation, without the paired quality signal (change failure rate), is dangerous. A team can dramatically improve deployment frequency by removing quality gates, reducing test coverage requirements, and skipping review steps. The resulting velocity number looks excellent until the change failure rate catches up with it — typically 4–8 weeks later in the form of elevated incident rates and emergency rollbacks. Fast and broken is worse than slow and stable. Always present throughput metrics alongside their quality counterparts.

Goodhart's Law applies to engineering

Any metric that becomes a target ceases to be a good metric. This is especially true for engineering KPIs that are tied to performance reviews or team bonuses. Use these metrics to understand system behavior and drive improvement conversations — not to rank engineers or set individual targets. The moment a metric is used for individual evaluation, the team will optimize the metric rather than the outcome it was intended to proxy.

How to Present Engineering KPIs to the Board

Board members are not engineering domain experts, and they should not need to be to understand whether your engineering function is performing well. The goal of a board-level engineering update is to answer three questions: Are we shipping reliably? Are we shipping faster over time? Are we managing risk? Five metrics answer all three.

The 5-Metric Executive Summary

Metric	How to Present	What It Answers
Deployment Frequency	Weekly bar chart, rolling 13-week trend	Are we shipping faster over time?
Change Failure Rate	Monthly %, vs. DORA industry benchmark band	Are we shipping reliably?
Lead Time (P50)	Hours or days, trend arrow (up/down/flat vs. prior quarter)	How fast can we respond to the business?
MTTR (P50)	Hours, plus incidents-per-month count	How fast do we recover when things break?
AI Adoption Rate	% of engineers active, trend vs. 90 days ago	Are we capturing the AI productivity advantage?

Present these five metrics as a single slide or dashboard section. Annotate significant changes with the operational reason — a spike in MTTR in week 8 because of a database incident, a drop in deployment frequency in week 11 because of a planned infrastructure migration. Context prevents misinterpretation by board members who do not have the operational backdrop.

For the AI adoption rate metric specifically, benchmark against publicly available industry data where possible. Teams that can say "our AI adoption rate is 78% against an industry median of 45%" are making a competitive talent and productivity argument at the same time.

KPI Review Cadence — Weekly, Monthly, and Quarterly

Not every KPI should be reviewed at the same cadence. The right cadence depends on how actionable the metric is at each time horizon and how quickly it changes in response to interventions.

Weekly Reviews

Weekly reviews should focus on operational metrics where interventions can happen quickly: deployment frequency (is the pipeline healthy?), PR throughput (is anyone blocked?), reviewer response time (are reviews piling up?), and active contributor count (is anyone dark?). These are the metrics that, if they are moving in the wrong direction, require action this week — not next quarter.

Keep weekly reviews to 15 minutes and data-only. Flag anomalies; skip context that can wait for monthly.

Monthly Reviews

Monthly reviews are where quality and health trends become visible. Change failure rate, rework rate, test coverage trend, and MTTR all move on a monthly cadence — week-to-week variance is too noisy to draw conclusions from. Monthly is also the right cadence for AI adoption metrics, since behavior change in tooling adoption takes 3–4 weeks to show up in usage data.

Monthly reviews should include a brief retrospective: did last month's interventions move the intended metric? Did anything move that was unexpected? What is the single highest-priority improvement for next month?

Quarterly Reviews

Quarterly reviews are for strategic evaluation. Compare the current quarter against the prior quarter and prior year (if data is available) on all five board-level metrics. Evaluate DORA tier positioning: did the team move from medium to high on any metric? What is the plan for the next quarter? Quarterly is also the right cadence for evaluating KPI framework changes — adding new metrics, retiring metrics that are no longer providing signal, and recalibrating benchmarks.

For a detailed treatment of how engineering velocity trends relate to business outcomes at each cadence, see engineering velocity tracking.

Putting It All Together

The 15 KPIs described in this guide give VPs of Engineering and CTOs a complete view of engineering performance across five dimensions: how fast you ship (DORA), how well you ship (code quality), how sustainably you ship (team health), how efficiently you ship given modern tooling (AI adoption), and how reliably you operate what you have shipped (operational excellence).

No single metric tells the full story. Change failure rate without deployment frequency does not distinguish between a team that ships once a week carefully and a team that ships daily carelessly. PR cycle time without PR throughput does not distinguish between a team with small, fast PRs and a team with large, stuck PRs that occasionally squeak through. These metrics work as a system — look at them together, watch the relationships between them, and the picture of what your engineering organization is actually doing becomes clear.

For how all of these metrics relate to developer experience and attrition risk, see the guide on developer experience metrics.

Engineering KPIs: The 15 Metrics Every VP of Engineering Should Track