The Problem with Traditional Agile Metrics

The velocity problem is well documented but stubbornly persistent. Velocity — story points completed per sprint — was introduced by the XP and Scrum communities as a capacity planning tool. The idea was simple: measure how much a team historically delivers so you can make better commitments for the next sprint. Used this way, velocity is a reasonable input to a planning conversation.

What happened instead: velocity became a performance metric. Engineering managers began comparing velocities across teams. Executives started tracking velocity trends in quarterly business reviews. Scrum Masters were asked why Team A was at 48 points and Team B was at 31. And the moment velocity became a performance signal, it stopped being an honest capacity signal.

The mechanism of gaming is well understood. Story points are a relative measure of effort, complexity, and uncertainty — not a unit of output. Teams can inflate their velocity by: using larger estimates for the same work, splitting stories to hit the sprint boundary, reducing story scope mid-sprint without adjusting estimates, or marking stories done before they meet a rigorous definition of done. None of these activities require dishonesty; they emerge naturally from rational actors responding to the incentives the metric creates.

The deeper problem is what velocity cannot measure: quality, sustainability, and customer impact. A team that ships 60 story points of technical debt in one sprint and spends the next three sprints dealing with its consequences has a higher velocity on paper than a team that ships 40 points of high-quality, well-tested, incident-free features every sprint. Traditional agile metrics reward the first team and penalize the second.

This is not an argument to abandon agile measurement. It is an argument to measure the right things — and to be precise about what each metric actually captures.

Traditional Agile Metrics: Keep, but Use Carefully

Not all traditional agile metrics are worthless. Several remain useful when applied with appropriate context constraints. The key discipline is using them as inputs to a conversation, not as conclusions in themselves.

Velocity (Story Points per Sprint)

Velocity is legitimately useful for capacity planning within a single team over time. If Team A has delivered between 38 and 44 points in each of the last eight sprints, you can plan their next sprint with reasonable confidence in that range. That is the correct use.

The uses to avoid: comparing velocity across teams, using velocity as a performance review input, setting velocity targets, and publishing velocity trends to stakeholders outside the team. Each of these creates the gaming incentives described above. Velocity should be a team-internal planning input, not an organizational performance signal.

Sprint Burndown

The sprint burndown chart — showing remaining work over time within a sprint — is useful for the team as a within-sprint signal. An abnormal burndown (flat for four days, then a sudden drop at day nine) is a useful prompt for a conversation about whether scope changed, work got blocked, or estimates were wrong.

What the burndown is not: a quality signal. A sprint that burns down perfectly can still ship bugs, incur technical debt, and require rework in subsequent sprints. A sprint that burns down messily can still ship excellent, well-tested features. The burndown measures the planning-to-execution fit, not the quality of execution itself.

Sprint Goal Achievement Rate

This metric is substantially better than raw velocity. Instead of measuring points completed, it measures the percentage of sprints in which the team achieved its primary sprint goal — the single most important outcome the team committed to for that two-week period.

A healthy team should achieve its sprint goal in roughly 80% of sprints. Lower than 60% suggests systemic planning problems, scope creep, or interruption load that is undermining the team's ability to make and keep commitments. Higher than 95% may indicate goals are being set too conservatively.

Sprint goal achievement rate is harder to game than velocity because it requires agreeing on a meaningful sprint goal before the sprint starts — something teams with bloated backlogs and weak product ownership often cannot do.

Escaped Defects per Sprint

Escaped defects — bugs discovered in production that originated from work completed in a given sprint — are a trailing quality signal. Tracking them per sprint, and attributing them to the sprint where the defective code was shipped, creates a feedback loop between sprint execution quality and production outcomes.

Escaped defects should be tracked in aggregate and trended, not used to penalize individuals. A team that ships two escaped defects per sprint and reduces that to 0.5 over two quarters has made a meaningful quality improvement. A team that ships zero escaped defects for three sprints and then ships six in sprint four has a different story — likely a systemic risk that accumulated and then released.

Flow Metrics: The Better Alternative

Flow metrics, derived from the work of Don Reinertsen and popularized through Kanban and modern agile practice, measure how work flows through a system rather than how much work a team estimates it will complete. They are harder to game, more predictive of delivery outcomes, and provide more actionable signals for improvement.

The four core flow metrics are cycle time, throughput, work in progress (WIP), and flow efficiency. A fifth, aging work items, rounds out the practical toolkit.

Cycle Time

Cycle time measures the elapsed time from when work transitions to "in progress" to when it transitions to "done." It should be measured per issue type — bugs have a different cycle time distribution than features, and infrastructure work is different again.

The most useful cycle time metric is not the mean but the percentile distribution. Track P50 (median cycle time — half of items complete faster than this), P75, and P95. The gap between P50 and P95 tells you about the predictability of your system: a narrow gap means most work flows similarly, a wide gap means you have a significant tail of unpredictable items that need investigation.

In Jira, cycle time can be derived from issue history — the timestamp when the status changed to your "In Progress" equivalent and the timestamp when it changed to "Done." In Linear, the same timestamps exist on issue state changes.

Throughput

Throughput measures the number of items completed per week. Not points — items. Raw item count is more honest than point count because it does not depend on the accuracy or consistency of estimates.

A team that completes 12 items per week and a team that completes 8 items per week are meaningfully comparable if their item sizes are similar. The right comparison is throughput alongside average cycle time — a team with high throughput and high cycle time is batching large items; a team with low throughput and low cycle time is completing small items but not enough of them.

Throughput is also the correct input for probabilistic forecasting: given a throughput distribution of 8–14 items per week over the last 12 weeks, how many sprints will it take to complete the 40-item backlog? Monte Carlo simulation using throughput history gives dramatically better forecasts than velocity-based point estimates.

Work in Progress (WIP)

WIP — the number of items actively being worked on at any given moment — is the most leveraged flow metric because reducing it directly improves cycle time. This is the core of Little's Law: in a stable system, cycle time equals WIP divided by throughput. If throughput is constant and you reduce WIP by half, cycle time halves.

In practice, most teams carry far too much WIP. A team of six engineers with 18 items in progress has an average of three simultaneous items per person. Context switching at that level imposes a productivity tax of 20–40% — time spent mentally switching between problems, re-reading context, and re-entering flow state rather than making progress on any single item.

Setting explicit WIP limits is the most direct mechanism for cycle time improvement available to a team. A reasonable starting point: WIP limit equals team size plus one. Items cannot enter "In Progress" if the limit is already reached — the team must finish something before starting something new.

Flow Efficiency

Flow efficiency is the ratio of active work time to total elapsed time:

Flow Efficiency = Active Time / (Active Time + Wait Time) × 100%

Most teams, when they measure this for the first time, are shocked by the result. Flow efficiency of 15–40% is typical. This means that for an item with a 10-day cycle time, the work was actively progressed for 1.5 to 4 days. The remainder was waiting: waiting for review, waiting for QA, waiting for a decision, waiting in a queue.

Flow efficiency exposes that cycle time improvements are not primarily an execution problem — they are a queue and handoff problem. The biggest gains come from reducing the time items spend waiting, not from getting engineers to work faster.

Aging Work Items

Aging work items are issues that have been in an active status for longer than two times the average cycle time for their type. A bug that typically closes in 3 days and has been in progress for 8 days is an aging item — it is a signal that something unusual is happening: blocked waiting for information, underestimated complexity, or simply forgotten amid competing priorities.

Reviewing aging items in standup or weekly team meetings is more valuable than reviewing burndown charts. Aging items are where hidden risk accumulates. They are also the source of the long tail in your cycle time distribution — improving P95 cycle time almost always starts with addressing aging items.

DORA Metrics for Agile Teams

DORA metrics — deployment frequency, lead time for changes, change failure rate, and MTTR — are typically discussed in a DevOps context, but they map directly onto agile ceremonies and planning artifacts in ways that are useful for Scrum Masters and engineering managers.

DORA Metric	Agile Mapping	What to act on
Deployment Frequency	Sprint cadence or continuous delivery rate	Batching too many items per deploy? Reduce batch size and decouple deploy from sprint end.
Lead Time for Changes	Time from backlog refinement to production — the full value cycle	Long lead time reveals whether the bottleneck is in backlog, sprint execution, review, or deployment pipeline.
Change Failure Rate	Escaped defects plus production incidents caused by sprint work	Rising CFR during a sprint theme points to quality problems in that domain.
MTTR	Incident response time — measures team's operational readiness	High MTTR often indicates poor runbook coverage or insufficient observability investment in sprint work.

The most useful DORA-agile connection is lead time. Traditional agile counts lead time as "first commit to PR merge," which captures only the coding phase. True lead time for changes — from the moment an item enters refinement to the moment the code is running in production — reveals the full cost of the planning-to-delivery cycle. Most teams, when they measure this end-to-end, find that coding takes 20–30% of total lead time. The remainder is queue time before refinement, waiting for sprint planning, review latency, and deployment lag.

For a deeper treatment of how lead time works end-to-end and where the common traps are, see our guide on the lead time trap.

The full DORA framework and how to instrument it from GitHub and incident data is covered in our complete DORA metrics guide.

Anti-Pattern Metrics to Drop

Several metrics appear in agile dashboards regularly but provide noise rather than signal. Tracking them wastes analytical bandwidth and, more importantly, creates perverse incentives that degrade the behaviors you actually want.

Lines of Code per Sprint

Lines of code is the most thoroughly discredited metric in software engineering — and yet it persists in certain organizations, particularly those with engineering leaders who come from manufacturing backgrounds where output volume is a legitimate measure.

The problem: lines of code measures output, not outcome. Deleting 500 lines of duplicated code and replacing them with a 20-line abstraction is a significant quality improvement. On a LOC-per-sprint dashboard, it looks like negative productivity. Code that does nothing — dead code, commented-out logic, redundant test setup — adds to LOC without adding value. LOC has no relationship with feature quality, user impact, or system reliability.

Commits per Developer

Commit count per developer is a GitHub activity metric masquerading as a productivity metric. It is trivially gamed — engineers who commit more frequently (a generally good practice) look more productive than engineers who commit larger, well-composed changes. It penalizes collaborative work (pair programming produces fewer commits per person), misses the dimension of change quality entirely, and provides no information about whether the work advanced any meaningful goal.

Story Points per Developer

If velocity at the team level is dangerous, velocity per developer is catastrophic. Disaggregating story points to the individual level breaks the relative nature of the estimation entirely — points are team estimates of team complexity, not an individual output unit. Reporting story points per developer creates competition within teams, discourages collaboration (helping a colleague with a blocker does not add to your point count), incentivizes over-claiming on estimates, and destroys the psychological safety needed for honest sprint planning.

This is the agile metric most likely to cause active harm to team culture. Stop tracking it.

Percentage Code Complete

"We are 70% done" is one of the most dangerous statements in software development. Progress estimates are systematically optimistic, and percentage complete reporting creates false confidence about delivery timelines. The last 30% routinely takes longer than the first 70% because it is when integration complexity, edge cases, and the gap between working and shippable emerge.

Replace percentage complete with throughput-based probabilistic forecasting: given current throughput, what is the probability of completing the remaining backlog by the target date? This gives stakeholders a distribution of likely outcomes rather than a false point estimate.

Scrum-Specific Metrics

For teams operating within the Scrum framework, several metrics extend the standard agile toolkit with signals specific to the sprint-based cadence.

Sprint Predictability

Sprint predictability measures the percentage of committed story points actually delivered at the end of the sprint. Unlike velocity, predictability is not about how much work a team does — it is about whether a team can accurately anticipate how much work it will do.

A healthy predictability range is 80–110%. Below 80% consistently indicates planning problems: overcommitment, underestimated complexity, or external interrupt load not accounted for in sprint capacity. Above 110% consistently (the team delivers more than committed) can indicate undercommitment — pulling in less work than capacity allows to create buffer, which reduces the value delivered per sprint.

Sprint predictability trends more usefully than the raw number. A team improving from 60% to 85% over six sprints is learning to plan better. A team oscillating between 50% and 120% has a noise problem — external variability or estimation inconsistency preventing reliable planning.

Defect Lead Time

Defect lead time measures the elapsed time from when a bug is filed to when it is resolved and deployed to production. Tracked by severity, it reveals whether the team's triage and response process is functioning correctly.

Critical defects should have a defined SLA — typically same-day or next-business-day resolution. Medium defects should close within one to two sprints. Low-priority defects should not be allowed to age indefinitely — a review of defects open for more than 90 days often reveals items that should be closed as "won't fix" rather than consuming ongoing triage overhead.

Review Cycle Count

Review cycle count measures how many rounds of code review a PR requires before merge. A PR that goes through four review cycles — changes requested, revised, more changes requested, revised again — took substantially more calendar time and cognitive bandwidth than a PR that merged on the first or second review.

High review cycle counts at the team level indicate one of several root causes: unclear acceptance criteria leading to rework, insufficient design discussion before coding starts, very large PRs that are difficult to review coherently in a single pass, or a code review culture that uses review as a quality gate rather than a collaborative improvement process.

Retrospective Action Item Completion Rate

This metric is rarely tracked and highly valuable: what percentage of action items committed to in sprint retrospectives are actually completed before the next retrospective?

A team with a 20% action item completion rate is not actually improving its process — it is documenting what is wrong and then doing nothing about it. The retrospective becomes a ritual rather than a mechanism for real change. Tracking completion rate and reviewing open action items at the top of each retrospective creates accountability and makes the improvement process observable.

Kanban-Specific Metrics

Teams operating with a Kanban or continuous-flow model use a set of metrics specifically suited to visualizing and managing a pull-based system without sprint-boundary constraints.

Cumulative Flow Diagram

The cumulative flow diagram (CFD) plots the count of items in each workflow state over time. When the bands are stable and parallel, work is flowing smoothly through each stage at a consistent rate. When a band widens, items are accumulating in that state — a visible bottleneck. When a band narrows, items are leaving that state faster than they are entering it.

The CFD is the most information-dense single chart available to a Kanban team. The gap between the top edge of the chart (items started) and the bottom edge (items completed) at any horizontal slice is the WIP level at that time. The horizontal distance between the same item's entry into started and its exit into completed is its cycle time. Both measurements emerge from a single visualization.

Service Classes by Lead Time SLA

Not all work should be treated equally in a Kanban system. Service classes allow a team to apply different lead time policies to different types of work:

Expedite: highest priority, bypasses WIP limits, reserved for true production incidents and customer-critical fixes. Typically less than 5% of total throughput.
Fixed date: work with an immovable external deadline — regulatory commitments, conference demos, contractual deliverables. Tracked against a countdown to deadline.
Standard: the majority of work. Subject to WIP limits and first-in-first-out ordering within the class.
Intangible: technical debt, refactoring, developer experience improvements. Low urgency, but needs a guaranteed minimum throughput to prevent indefinite deferral.

Managing service classes prevents the common failure mode where expedite items permanently displace standard work, destroying lead time predictability for the majority of the backlog.

Lead Time Distribution Histogram

The single most important thing to know about your lead time is not the average — it is the shape of the distribution. Most lead time distributions are right-skewed: a cluster of items that complete quickly and a long tail of items that take much longer.

Report P50, P75, and P95 cycle times. The P50 is what most items experience. The P95 is what your unluckiest customers and commitments experience. The gap between P75 and P95 quantifies the tail risk in your system. Making commitments based on P50 will cause you to miss roughly 50% of them. Making commitments based on P85 will cause you to miss about 15%. Choose your percentile based on the consequences of missing.

The flow efficiency reality check

Most teams, when they first measure flow efficiency, find it is between 15% and 40%. An item with a 10-day cycle time was actively worked on for 1.5 to 4 days. The rest was waiting time. Cycle time improvements almost never come from getting engineers to work faster. They come from reducing queue time and handoff latency.

Metrics for Remote and Async Agile Teams

The shift to remote-first and async engineering culture that accelerated after 2020 has not reversed. By 2026, the majority of engineering teams at growth companies operate across multiple time zones with significant portions of daily collaboration happening asynchronously. Traditional agile metrics were designed for co-located, synchronous teams. Several additional signals have become essential for async team health.

PR First-Response Time

The elapsed time from PR opening to first substantive review comment is one of the strongest indicators of async collaboration health. In a co-located team, a developer can walk over and ask for a review. In an async team, the PR sits in a queue until someone picks it up.

Long PR first-response times — measured in days rather than hours — create two compounding problems. First, the PR author loses context and momentum while waiting. Second, the review cycle count tends to increase because the reviewer is reviewing older, less-remembered work without the benefit of a quick conversation. Track PR first-response time at the team level and investigate systematically when P75 exceeds 24 hours. For more on the full PR cycle time picture, see our guide on developer experience metrics.

Off-Hours Commit Rate

The percentage of commits made outside of each engineer's expected working hours is a leading indicator of overwork and unsustainable pace. Chronic off-hours committing typically precedes burnout, attrition, and quality degradation.

This metric should be used carefully and with explicit team awareness. It should never be presented as a performance signal. Its correct use is as a team-health signal reviewed in aggregate — if 60% of the team is consistently committing between 9pm and midnight, the retrospective conversation is about workload and expectations, not about individual behavior.

Meeting-Free Coding Time per Engineer

Focus time — contiguous blocks of two or more hours without meetings — is the primary input to complex coding work. Engineering managers can track focus time availability from calendar data and correlate it with throughput and cycle time.

Teams with less than 3 hours of daily meeting-free time per engineer consistently show higher cycle times and lower throughput than teams with 5 or more hours. The lever is meeting audit and consolidation — combining standup, planning, and review meetings into fewer, longer blocks concentrated in defined windows, leaving remaining time as protected focus time.

Async Feedback Loop Time

Async feedback loop time measures the average time between a question or decision request in an async channel (Slack thread, GitHub comment, Linear discussion) and a substantive response. In distributed teams, slow async feedback is the primary driver of the waiting time that degrades flow efficiency.

Teams can improve async feedback loop time through explicit response time agreements — not promises to respond immediately, but commitments to respond within a defined window — and through escalation paths for time-sensitive unblocking needs.

The Right Agile Metrics Stack for 2026

Most engineering teams track too many metrics, rather than too few. A large dashboard of agile indicators creates analysis paralysis and diffuses attention. The goal is a small set of metrics that together give you signal on throughput, quality, delivery health, and team sustainability — and that you act on regularly rather than reviewing passively.

Here is the recommended stack:

Metric	What it tells you	Review cadence
Throughput (items/week)	Capacity planning; probabilistic forecasting input	Weekly rolling 6-week average
Cycle time P50/P75 by issue type	Flow health; delivery predictability	Sprint retrospective
DORA 4 metrics	Delivery quality; pipeline health	Monthly trend review
PR cycle time	Code review health; async collaboration signal	Weekly
Sprint goal achievement rate	Planning quality; commitment reliability	Per sprint
Escaped defects per sprint	Trailing quality signal; regression risk	Per sprint
Well-being score (quarterly)	Team sustainability; burnout early warning	Quarterly pulse survey

Notice what is absent from this stack: velocity, lines of code, commits per developer, story points per individual, and percentage complete. These are the metrics that create the most organizational harm relative to their signal value.

Notice also that the stack combines flow metrics (throughput, cycle time), delivery metrics (DORA), collaboration metrics (PR cycle time), and a human sustainability signal (well-being). No single dimension is sufficient. A team with excellent throughput and cycle time but deteriorating well-being scores is heading toward an attrition event. A team with strong well-being and DORA metrics but widening cycle time has an emerging flow problem.

The metrics you track shape the team's understanding of what "doing well" means. Track outcomes — throughput, cycle time, deployment health, quality — and let the practices that achieve those outcomes emerge from the team rather than from the measurement system. Stop tracking activity proxies — commits, points, lines — that measure effort instead of impact.

From agile metrics to delivery intelligence

The next layer beyond agile metrics is predictive: knowing before a sprint ends which items are at risk of slipping, which PRs are at risk of causing incidents, and which teams are accumulating technical debt that will slow future sprints. That is where flow metrics, DORA data, and code change signals intersect — and where the most actionable engineering intelligence lives.

Read: Developer experience metrics that predict attrition and burnout →

Agile Metrics That Actually Matter in 2026 (And Ones to Ditch)