Why Story Points Are a Broken Velocity Metric

The appeal of story points is understandable: they seem to quantify work in a planning-friendly unit that abstracts away time estimates. In practice, they measure something far narrower than delivery velocity — they measure estimation accuracy within a single team. A story point is not a unit of output. It is a unit of perceived effort that exists only within the context of one team's relative estimation history.

The damage comes from what happens when teams are evaluated on story point velocity. Rational engineers do what any rational person does when measured on a metric: they optimize for it. Teams inflate estimates on new work to protect their velocity numbers. They split tickets to make the sprint look fuller. They defer work that is hard to point but genuinely important — refactoring, documentation, test coverage — because it does not generate enough points per hour of effort. The story point velocity number goes up. Actual throughput stays flat or declines.

McKinsey's engineering effectiveness research found that teams tracking output-based metrics — pull requests merged, features shipped to production — had 30% higher actual throughput than teams tracking story points alone. The difference is not that output-tracking teams work harder. It is that output-based metrics create incentives that are aligned with delivery, not with estimation. When your velocity metric is "PRs merged per developer per week," you cannot inflate it by splitting tickets. The work either shipped or it did not.

This does not mean story points are useless for sprint planning. They are a reasonable capacity planning tool inside a team. They are a poor engineering velocity tracking metric across teams, over time, or as a performance signal to leadership.

The 5 Velocity Metrics That Actually Measure Delivery Speed

The following five metrics are derived from code activity and deployment data — not from estimation records. Each one is calculable from your GitHub or GitLab API with no human input required.

1. Cycle Time

Cycle time is the elapsed time from the first commit on a branch to the pull request being merged. It splits into two meaningful sub-components: coding time (first commit on the branch to PR opened) and review time (PR opened to PR merged). The split matters because they diagnose different problems. Long coding time indicates scope creep or blocked work. Long review time indicates a review process bottleneck — too few reviewers, slow feedback cycles, or PRs that are too large to review efficiently.

PR review speed is the single biggest lever most teams have on cycle time. Teams that reduce median review time from 24 hours to 4 hours see overall cycle time drop by 30–50% with no change to how they write code. Elite performers (DORA Elite tier) have a median cycle time under 2 days and a median review time under 4 hours.

2. Throughput

Throughput is the number of pull requests merged per developer per week, normalized by team size. Raw PR count is not useful as a cross-team comparison — a team of 10 naturally merges more PRs than a team of 3. Normalize by dividing total PRs merged in a period by the number of active contributors (developers with at least one commit in that period). Throughput is your highest-signal measure of sustained delivery rate.

Track throughput on a rolling 4-week basis to smooth out sprint-boundary artifacts (most teams have low throughput on Monday and high throughput on Friday as sprint cycles close). Healthy throughput for a senior engineer working on a mature codebase is 3–6 PRs per week. Throughput below 1 PR per developer per week is a consistent signal of excessive WIP or blocked work.

3. Deployment Frequency

Deployment frequency measures how often working software reaches production — not how often it reaches staging, not how often CI runs. DORA's research consistently shows deployment frequency is the fastest-moving DORA signal for teams improving their delivery process, because it responds quickly to process changes (moving from manual to automated deployments, enabling feature flags for safer releases).

Elite teams deploy to production multiple times per day. High performers deploy daily to weekly. If your team deploys less than once per week, deployment frequency is your primary velocity constraint — not cycle time, not throughput.

4. PR Age at Merge

PR age at merge is the elapsed time from PR opened to PR merged, tracked at P50 and P75 (median and 75th percentile). The percentile split is important: a healthy P50 with a bloated P75 indicates a tail of complex or contentious PRs that are getting stuck, while all other work moves normally. A P50 above 5 days is a consistent bottleneck signal — it means the typical PR is sitting in review for longer than a work day.

Track P75 separately because it catches the worst-case review experience that averages hide. A P75 above 10 days means 25% of your PRs are sitting open for two work weeks — a significant context-switching cost for both authors and reviewers.

5. In-Progress Work in Flight (WIP)

WIP is the number of open pull requests per developer at any given time. High WIP is a velocity killer because context switching between multiple open PRs degrades the quality of focus on each one, review requests pile up across multiple threads simultaneously, and the probability that a PR becomes stale (merge conflicts, outdated approach) increases with age.

Track this as a point-in-time snapshot and as a daily average per developer over a sprint. Developers with more than 5 simultaneous open PRs have measurably longer cycle times — the research on this is covered in the WIP limits section below.

The Velocity vs. Quality Tradeoff

Velocity metrics in isolation can be gamed or misinterpreted. A team can increase PR throughput by shipping smaller, less-tested changes — boosting velocity while degrading quality. The antidote is to track velocity alongside your change failure rate (CFR), so you always see both the speed signal and the quality signal together.

The four-quadrant diagnostic below maps team velocity patterns to recommended actions:

Team pattern	Velocity signal	Quality signal	Risk level	Action
High throughput + low CFR	Healthy	Healthy	Low	Scale the team
High throughput + high CFR	Good velocity	Bad quality	Medium	Add review rigor
Low throughput + low CFR	Slow	Careful	Medium	Reduce WIP
Low throughput + high CFR	Slow	Bad	High	Stop and diagnose

The high-throughput/high-CFR quadrant is the most dangerous because it looks healthy on velocity dashboards while quality is silently degrading. Teams in this pattern are typically shipping PRs that are too large, skipping review steps under time pressure, or deploying without sufficient test coverage. Add required reviewers, PR size limits, and automated test gates before celebrating the throughput number.

How to Calculate Cycle Time from GitHub Data

Cycle time from GitHub requires two API calls per PR: one for the PR metadata (created, merged timestamps) and one for the commits to find the first commit timestamp. The breakdown gives you both coding time and review time as separate metrics.

The full cycle time chain is: first commit on branch → PR opened → PR approved → PR merged. The coding time component is the delta from first commit to PR opened. The review time component is the delta from PR opened to PR merged. For most teams, reducing review time is the faster win — coding time is constrained by the complexity of the work, while review time is constrained by process.

Start by fetching recently merged PRs from your target repository:

# Step 1: Fetch merged PRs for the lookback window
GET /repos/{owner}/{repo}/pulls?state=closed&per_page=100&sort=updated&direction=desc

# Key fields from the response:
# pull.number          → PR ID
# pull.created_at      → PR opened timestamp
# pull.merged_at       → PR merged timestamp (null if closed without merge)
# pull.head.sha        → head commit SHA

# Filter: only include records where merged_at is not null
# Review time = merged_at − created_at

To get the first commit timestamp (coding time start), use the PR commits endpoint:

# Step 2: Fetch commits for each PR to find the first commit
GET /repos/{owner}/{repo}/pulls/{pull_number}/commits

# Response is ordered oldest-first by default
# commits[0].commit.committer.date → first commit timestamp on the branch

# Coding time = pull.created_at − commits[0].commit.committer.date
# Full cycle time = pull.merged_at − commits[0].commit.committer.date

# Aggregate across all PRs in the window:
# Median cycle time  → P50 of (merged_at − first_commit_date)
# Median review time → P50 of (merged_at − created_at)

# Elite team benchmarks:
# Median cycle time  < 2 days
# Median review time < 4 hours

Use the median (P50) rather than the mean for both numbers. Cycle time distributions are right-skewed — one large refactor PR with a slow review will inflate the mean significantly while the median stays representative of typical developer experience. Track P75 separately to detect the tail of stuck PRs that averages conceal.

The Deployment Frequency Trap

The single most common mistake in engineering velocity tracking is measuring deployment frequency against the wrong environment. Many teams deploy to staging or development environments continuously — sometimes dozens of times per day — while deploying to production once per week or less. If your deployment frequency metric counts all environment deployments, your number will be inflated by 10–20× relative to what DORA actually measures.

DORA measures production deployments only. A staging deployment does not move the DORA needle. This matters because the risks and quality signals that deployment frequency predicts — system stability, rollback rates, MTTR — all operate in production. A high staging deployment frequency with a low production deployment frequency is not a velocity signal; it is a release process bottleneck signal.

When using the GitHub Deployments API, filter explicitly to the production environment tag:

# Correct: filter to production environment only
GET /repos/{owner}/{repo}/deployments?environment=production&per_page=100

# Incorrect: unfiltered — includes all environments
GET /repos/{owner}/{repo}/deployments?per_page=100

# Also filter by deployment status (only count successful deploys)
GET /repos/{owner}/{repo}/deployments/{deployment_id}/statuses
# → look for statuses[].state === 'success'

# Trap detection heuristic:
# production_frequency = count(deployments where environment = 'production')
# total_frequency      = count(all deployments)
# ratio = production_frequency / total_frequency

# If ratio < 0.2, you are almost certainly measuring the wrong environment.
# A ratio below 0.05 means less than 1 in 20 deployments goes to production —
# a near-certain sign that dev/staging deploys are inflating your numbers.

For teams using Vercel, Railway, or AWS CodeDeploy as their CD platform rather than GitHub Deployments, pull deployment events from that platform directly and filter to your production target. Vercel exposes deployments via the Vercel API with an PRODUCTION target field. Railway exposes deployments per service and environment through the Railway GraphQL API. Always filter to the production surface before computing frequency.

The environment filter is not optional

Teams that discover this mistake typically find their real deployment frequency is 5–10× lower than the number they have been reporting. The corrected number is more useful — it exposes a real bottleneck — but prepare for the conversation before surfacing it to leadership.

WIP Limits: The Velocity Multiplier No One Implements

Little's Law is a mathematical relationship from queuing theory that applies directly to software delivery: cycle time = WIP / throughput. For a constant throughput rate, doubling WIP doubles cycle time. Halving WIP halves cycle time. This relationship holds empirically in engineering teams and means that reducing the number of concurrent open PRs per developer is one of the most reliable interventions available — and it requires zero new tooling, zero process overhead, and zero additional engineering capacity.

A team with 8 concurrent open PRs per developer and a throughput of 4 PRs merged per week per developer has a cycle time of 2 weeks (8 / 4). The same team, limiting WIP to 4 concurrent open PRs, has a cycle time of 1 week at the same throughput rate. No code was written faster. No engineers were hired. The velocity doubled from a process change.

The Accelerate research (the book underlying DORA) found that developers with more than 5 simultaneous open PRs have 2.3× longer cycle times than developers with 3 or fewer open PRs. The mechanism is context switching: every time a developer opens a sixth PR, they fragment their attention across more review threads, more sets of comments to address, and more merge conflict risk as branches diverge.

Measure WIP per developer using the GitHub REST API:

# Fetch all open PRs for a repository
GET /repos/{owner}/{repo}/pulls?state=open&per_page=100

# Group by PR author login:
# wip_per_developer = {
#   'alice': 2,
#   'bob': 6,   ← high WIP signal
#   'carol': 3,
# }

# Team-level WIP = average open PRs per active developer
# team_wip = sum(open_prs) / count(unique authors with open PRs)

# WIP alert thresholds:
# > 5 open PRs per developer → high WIP, 2.3× longer cycle time expected
# > 3 open PRs per developer → moderate WIP, begin monitoring
# ≤ 3 open PRs per developer → healthy WIP range

Track this as a daily snapshot rather than a point-in-time reading — sprint cycles cause WIP to spike at sprint start (many PRs opened) and dip at sprint end (many PRs merged). A 7-day rolling average of WIP per developer smooths the sprint-boundary artifact and gives you a clean trend line.

Implementing WIP limits in practice means having a team agreement: finish before you start. Before opening a new PR, address review comments on your existing open PRs first. This is a cultural shift as much as a process one, and it works better when the WIP metric is visible to the team in a shared dashboard rather than buried in an engineering metrics tool that only managers see.

How Koalr Tracks Engineering Velocity Automatically

Building and maintaining the GitHub API polling, the cycle time correlation logic, and the WIP calculation infrastructure described above is a significant engineering investment that does not ship product features. Koalr handles all of it automatically from a single GitHub connection.

Koalr's engineering velocity tracking covers:

Cycle time per developer and per team — coding time and review time broken out separately, tracked at P50 and P75, with trend lines over the trailing 90 days so you can see whether a team is improving or regressing.
PR age distribution — histogram of PR ages at merge, with P50/P75/P90 breakpoints and flagging of PRs that have been open longer than your configured threshold (default: 5 days).
WIP heatmap — daily open PR count per developer, visualized as a heatmap over the sprint, with alerts when any developer exceeds 5 concurrent open PRs.
Deployment frequency per service — filtered to production environment only, with the staging/production ratio displayed so you can detect the environment filter trap before it corrupts reporting.
Velocity trend over 90 days — throughput, cycle time, and deployment frequency tracked week-over-week so sprint-to-sprint variation does not obscure genuine trends.
DORA tier classification — automatic Elite / High / Medium / Low classification based on your deployment frequency, lead time, CFR, and MTTR, updated daily.

Koalr's AI chat panel lets engineering managers ask natural-language questions against live data. Queries like "which team has the slowest cycle time this sprint?" or "show me developers with more than 5 open PRs right now" return answers in under 3 seconds — without opening a BI tool, writing a SQL query, or waiting for a weekly report.

See how Koalr fits into an engineering manager's workflow on the engineering managers use case page, or start tracking your team's velocity metrics today.

Engineering Velocity Tracking: The Metrics That Actually Matter (Beyond Story Points)