Why Story Points Fail as a Velocity Measure

Story points were designed as a relative estimation tool — a way for a team to communicate the relative complexity of work items without committing to specific time estimates. They were never designed to be a velocity metric. The Agile community has been largely consistent on this point for over a decade. But in practice, story points became velocity numbers became targets, and once they became targets, they stopped being useful.

The failure mode is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. Story point velocity is gameable in ways that are entirely rational from the perspective of an individual or team being measured by that number. Teams facing pressure to increase velocity can increase it immediately — without shipping any more value — by re-estimating stories upward, splitting large stories into multiple smaller ones, or simply calibrating their estimates to the expected throughput rather than actual complexity.

Even without deliberate gaming, story points are inconsistent across teams, time periods, and estimation sessions. A team that has been working together for two years estimates differently than they did when they started. A team that just hired three new engineers estimates differently after the additions. Story point velocity is not comparable across teams, is not stable over time within a single team, and does not correlate reliably with actual business value delivered.

The story points problem

Ron Jeffries, one of the original authors of Extreme Programming who helped invent story points, has repeatedly stated that they were not designed to be used the way most organizations use them. "Story points measure story points," is a common refrain — they do not measure value, complexity, or time in any consistent or cross-team-comparable way.

Better Velocity Measures

Throughput: PRs Merged per Week

Throughput — the count of pull requests merged per week, per engineer or per team — is a more honest velocity proxy than story points for a simple reason: it measures something that actually happened. A merged PR represents code that was written, reviewed, tested, and integrated. It is an observable, verifiable output.

Throughput is harder to inflate than story points. You cannot re-estimate a PR upward after the fact. You cannot split a merged PR into multiple smaller ones to boost your count. The number of PRs merged is the number of PRs merged.

Throughput is not perfect — a team that ships many small PRs will show higher throughput than a team that ships fewer large PRs, even if the latter is delivering more business value. This is why throughput is most useful when paired with PR size data: stable throughput with stable median PR size is a healthy signal. Growing throughput with declining PR size may indicate PR splitting behavior.

Deployment Frequency

Deployment frequency — the rate at which code reaches production — is the DORA metric most closely correlated with engineering throughput. It has an important advantage over PR throughput: it measures output at the business level, not the engineering process level. A PR that merges but does not deploy is not delivering value. A deployment is.

Deployment frequency is also harder to game than PR throughput because it requires actual production releases. It can be inflated by deploying trivial changes, but that is at least a discipline (small, frequent deployments) that DORA research validates as correlated with better outcomes.

Issue Cycle Time

Issue cycle time — the time from a ticket entering "In Progress" to the corresponding code reaching production — is the most complete velocity signal of the three. It captures the end-to-end time from work starting to value being delivered, including all handoffs, review delays, and deployment gates.

Issue cycle time is equivalent to DORA's "lead time for changes" metric. Teams with short cycle times deliver value faster. Teams with long cycle times are accumulating work-in-progress and the associated coordination costs. Tracking cycle time at the P50 and P75 percentiles over rolling 30-day windows gives you a stable, non-gameable velocity signal.

Metric	Gameable?	Cross-team comparable?	Stable over time?
Story points	Yes — easily	No	No
PR throughput	Difficult	With caveats	Yes
Deployment frequency	Very difficult	Yes	Yes
Issue cycle time	Difficult	Yes	Yes

The Velocity Trap

The velocity trap is the failure mode where a team successfully optimizes for velocity metrics while simultaneously degrading the outcomes those metrics were meant to proxy. It is distinct from gaming — it does not require bad intent. It happens naturally when teams are measured on throughput without also measuring quality, rework rate, and technical debt accumulation.

A team in the velocity trap ships more PRs per week, deploys more frequently, and hits its cycle time targets — while increasing change failure rate, increasing post-release bug reports, and building features that do not move the business metrics they were designed to move. The velocity metrics look good. The business outcomes do not.

The antidote is pairing velocity metrics with outcome metrics. Throughput should be paired with change failure rate. Deployment frequency should be paired with MTTR. Issue cycle time should be paired with rework rate (the percentage of issues that return to engineering post-release). A team that is shipping fast and breaking things is not a high-velocity team — it is a team that has optimized for the leading indicator while the lagging indicators have not caught up yet.

Healthy Velocity Signals

A team with healthy velocity shows a specific combination of signals:

Consistent throughput. Week-over-week PR merge counts and deployment counts are stable within a normal range, without large spikes or crashes. Spikes often indicate pre-review rush behavior (shipping everything before a deadline). Crashes indicate blocked work, large-PR accumulation, or team disruption.

Low rework rate. A small percentage of completed issues return to engineering within 30 days for defect fixes. Rework is the clearest signal that velocity is being achieved at the expense of quality — it is work counted twice, with the second count representing cost that was already paid in the first.

Stable P75 cycle time. The 75th percentile of issue cycle time should be stable or declining over time. A P75 that is slowly growing — even if P50 is stable — indicates that the slower quartile of work is getting slower, which often means certain categories of work (large features, cross-team dependencies, high-risk changes) are accumulating delays that the median does not capture.

PR size stability. Median PR size (lines changed) should be stable. Growing PR size is a leading indicator of future cycle time increase, because larger PRs take longer to review and are more likely to require revision cycles.

Anti-Patterns to Watch For

Velocity Spikes Before Reviews

If your team shows consistent throughput spikes in the week before a sprint review, quarterly business review, or executive demo, you have a sprint-end rush pattern. This is a velocity anti-pattern: PRs are being held and batch-released rather than being merged continuously. The spike looks like high velocity; the pattern indicates a batched release process that will not scale.

Velocity Drops After Team Changes

A temporary velocity dip after adding new team members is expected — onboarding tax is real. But a dip that persists for more than 6–8 weeks without recovering suggests that the team is not effectively integrating new members. Watch for cycle time increasing as new engineers' PRs spend longer in review than the team average, indicating that knowledge transfer is the bottleneck.

Large PR Sizes

PRs above 500 lines of changed code are a structural velocity risk. They take longer to review, generate more review cycles, are harder to test, and have higher change failure rates than small PRs. A team with a growing median PR size is accumulating future cycle time debt even if current throughput looks stable.

Leading Indicators of Velocity Problems

The most valuable use of engineering metrics is not measuring current velocity — it is detecting future velocity problems before they show up in sprint output. Two leading indicators are particularly reliable:

Growing PR age. The average age of open PRs (time since opened, for PRs that have not yet merged) is a leading indicator of cycle time. When PR age is growing, it means work is accumulating in the pipeline without completing. Cycle time will increase 2–4 weeks after PR age begins to grow — a window in which the engineering organization can intervene before the problem becomes visible in output.

Increasing review queue depth. The count of PRs that have been open for more than 24 hours without a review event is a leading indicator of both cycle time and team frustration. When review queue depth grows, engineers wait longer for feedback, lose context on their work, and begin to treat PR review as a blocking step rather than a collaborative process. Teams that track review queue depth in real time can address it before it compounds.

The 2–4 week lag

PR age and review queue depth are early signals with a 2–4 week lag before they appear in throughput and cycle time metrics. Teams that monitor these leading indicators can address velocity problems during the lag window, before they show up in the metrics executives are watching.

The Right Mental Model for Engineering Velocity

Engineering velocity is not a number to maximize. It is a system to understand. A healthy velocity system has consistent throughput, low rework, stable cycle time percentiles, and leading indicators that are stable or improving. A team in the velocity trap has high throughput and degrading outcome metrics. A team with a developing velocity problem has stable throughput but growing PR age and review queue depth.

The goal of measuring engineering velocity is not to create a leaderboard. It is to give engineering leadership the signal needed to remove blockers, allocate attention, and catch problems while they are still small enough to fix without disruption. The teams that do this well do not talk about velocity as a number — they talk about the health of their delivery pipeline, and velocity is the emergent result.

Engineering Velocity: How to Measure It Without Gaming It