The Metrics That AI Coding Assistants Break

Before discussing what to measure, it is worth understanding which established metrics AI tooling distorts — and why.

Lines of code per engineer

Lines of code was always a weak proxy for productivity, and most teams had already stopped using it by 2020. AI coding assistants make it definitively useless. An engineer using Cursor with a modern LLM can generate 10x the raw code volume of an engineer writing manually — but that raw volume tells you nothing about the quality of decisions being made, the complexity being tackled, or the architectural soundness of the output.

If your team is still tracking lines of code, remove it from your dashboard now. What replaces it: PR velocity (PRs merged per week) combined with change failure rate to distinguish productive throughput from volume without value.

Story points per sprint

Story points measure estimated effort. AI coding assistants compress the time required to implement a given story — which means that if your estimation process has not been recalibrated for AI, your story points are now systematically too high. Teams using Copilot or Cursor heavily find that stories estimated at five points get completed in the time previously allocated to two-point stories.

This creates an apparent productivity increase that is actually an estimation drift. Teams whose velocity appears to spike after Copilot adoption should re-examine whether their points are being recalibrated or whether they are simply burning through stories faster than planned and pulling in unplanned work.

Time to first commit

Time to first commit on a new feature was historically a reasonable proxy for onboarding speed and initial design clarity. With AI assistants, engineers can generate a scaffold for a new feature in minutes — but that first commit may represent less genuine design thinking than it did when every line was typed manually. Use it with caution.

What GitHub Copilot Actually Changes

Copilot's impact is concentrated in three observable places: suggestion acceptance rate, code churn, and PR velocity.

Acceptance rate: the core Copilot metric

GitHub exposes Copilot usage telemetry through the GitHub Copilot API. The most important signal is acceptance rate — the percentage of inline suggestions that engineers accept rather than dismiss or overwrite. This is meaningful because it measures whether the tool is generating code that fits the engineer's intent, not just whether the tool is being used.

Acceptance rate varies significantly by codebase type. Repetitive, pattern-heavy code (React components, CRUD controllers, test boilerplate) sees acceptance rates of 40–60%. Novel algorithmic work or deeply context-dependent business logic typically sees 15–25%. A team-level average of 30–35% is healthy for a mixed product engineering codebase.

Code churn and the quality question

There is a reasonable concern that AI-generated code introduces more churn — code that gets written and then quickly rewritten or deleted. Early adopter teams reported mixed results here. The pattern that emerges consistently: churn is higher in teams that accept suggestions without review, and lower in teams with strong review culture that treat AI suggestions as drafts rather than finished code.

Track code churn (lines modified or deleted within 14 days of being written) as a companion metric to acceptance rate. If acceptance rate is high and churn is also high, your team is accepting suggestions that are not fit for purpose. If acceptance rate is moderate and churn is low, the tool is being used well.

PR velocity correlation

The most concrete business case for Copilot is PR velocity. Teams with sustained Copilot acceptance rates above 40% ship approximately 23% more PRs per engineer per week compared to their pre-Copilot baseline, controlling for headcount changes. (This figure is illustrative, based on internal tracking patterns seen across engineering teams that have shared data; individual results will vary by codebase type, team size, and review process.)

The effect is most pronounced in implementation-heavy sprints and least pronounced in design-heavy or architecture-heavy periods — which makes intuitive sense. Copilot accelerates the mechanical parts of software development; it does not accelerate thinking.

What Cursor Changes

Cursor operates differently from Copilot. Where Copilot is primarily a suggestion engine integrated into your existing IDE, Cursor is an AI-native editor that allows engineers to have multi-turn conversations about their codebase, generate entire files from natural language descriptions, and run agentic tasks across multiple files simultaneously.

This creates a different set of measurable signals.

Request volume and model usage

Cursor's API exposes request volume per user per day and the model being used per request (fast vs. premium model tiers). Request volume is a usage intensity signal — engineers who are using Cursor effectively tend to generate 50–150 requests per active coding day. Below 20 requests per day often indicates the tool is being used only for simple autocomplete, not for the more powerful multi-file and agentic features.

Model selection matters for cost management. Premium models (Claude claude-sonnet-4-6, GPT-4o) cost more per request than fast models (Claude claude-haiku-4-5, GPT-4o-mini). Teams should track the split between model tiers to understand their actual cost per active developer day and whether engineers are using premium models for tasks that would be adequately served by fast models.

Spend vs. time saved

Cursor pricing is per seat per month. The ROI question is whether the license cost is less than the time value of hours saved. This requires tracking time-to-completion on comparable tasks before and after Cursor adoption — which most teams do not do systematically. A reasonable proxy: if PR velocity per engineer increases by 15% or more after Cursor adoption, the tool is paying for itself at any reasonable engineering hourly rate.

Tool	Primary metric	Secondary metrics	Data source
GitHub Copilot	Acceptance rate	Code churn, AI-assisted PR %	GitHub Copilot Metrics API
Cursor	Request volume / day	Model tier split, spend per dev	Cursor admin dashboard

Old Metrics vs. New Metrics: A Direct Comparison

Old metric (pre-AI era)	Why it breaks with AI tooling	New / complementary metric
Lines of code	Volume inflated by AI generation	PRs merged + change failure rate
Story points / sprint	Estimation drift as AI compresses impl time	Cycle time (actual elapsed, not estimated)
Time to first commit	AI scaffolding makes it near-instant	Time from PR open to first substantive review
Code review thoroughness	Reviewers less likely to scrutinize AI-generated code	Code churn rate (post-merge rewrites)

How to Measure AI ROI

The standard framework for measuring AI tooling ROI in engineering:

AI ROI = (Hours saved × Hourly rate) − License cost per period

The hard part is estimating hours saved. The most defensible approach: run a controlled comparison. Track PR velocity and cycle time for a cohort of engineers before Copilot or Cursor adoption, then for the same cohort after 60 days of active use. The velocity increase, translated into engineer-hours, is your hours-saved estimate.

Example: a team of 10 engineers with a $19/month Copilot license costs $190/month. If each engineer ships 20% more PRs per week and each PR represents roughly 3 hours of work, the time saved per engineer per month is approximately 10 hours. At a $100/hour fully-loaded engineering cost, 10 engineers saving 10 hours each generates $10,000 in time value against a $190 license cost. The ROI is clear — but only if you are measuring velocity before and after, not just assuming the tool is working.

The Koalr Approach: AI Adoption Alongside DORA

The insight that drives Koalr's approach to AI metrics is that AI adoption and traditional DORA metrics need to be tracked in the same view, not in separate dashboards. A team can have excellent Copilot adoption metrics while their change failure rate is climbing — which means AI-generated code is shipping fast but not reliably. Or a team can have low AI adoption while their DORA metrics are stagnant — which suggests an AI tooling investment would yield immediate throughput gains.

The correlation that matters most: teams with Copilot acceptance rates above 40% consistently show improvement in both PR velocity and cycle time relative to their pre-adoption baseline. Teams that adopt Copilot but see acceptance rates plateau below 20% typically do not see meaningful DORA improvement — the tool is installed but not being used in ways that affect delivery.

The 40% threshold

Teams with Copilot acceptance rates above 40% show approximately 23% faster PR velocity compared to their pre-Copilot baseline. Below 20% acceptance, the velocity improvement is negligible — the license cost is not being justified by actual usage. Acceptance rate is the leading indicator; PR velocity is the lagging outcome.

What to Instrument Right Now

If your team has GitHub Copilot, enable the Copilot Metrics API (available via the GitHub REST API at GET /orgs/{org}/copilot/metrics) to get per-team and per-user acceptance rates, suggestion counts, and active user counts. Pipe these into your engineering analytics alongside your DORA metrics.

If your team uses Cursor, export usage data from the Cursor admin dashboard monthly and track: active users per month, requests per active user per day, and premium model usage percentage. Correlate these with your sprint-level PR velocity to build an ROI picture over time.

The goal is a single view where an engineering manager can see DORA metrics, PR velocity, cycle time, and AI adoption signals side by side — and ask questions like "why did cycle time improve last sprint?" with enough context to distinguish between AI adoption effects and other causes.

How AI Coding Assistants Are Changing Engineering Metrics (And What to Measure)