The Rework Rate Problem: Measuring AI Code Quality at Scale
As AI coding tools generate a larger share of your codebase, the question of AI code quality becomes urgent — and vague. "Is our AI code good?" is not a measurable question. "What is our rework rate for AI-assisted changes versus human-written changes?" is. This guide walks through the practical work of defining, measuring, and monitoring rework rate as a quality signal for AI-generated code.
Why rework rate beats code review scores
PR approval rates, review times, and comment counts are process metrics. Rework rate is an outcome metric — it measures whether code that shipped actually worked. Outcome metrics are dramatically harder to game and much more useful for evaluating AI tool ROI.
Step 1: Define What Counts as Rework
Before measuring rework rate, your team needs to agree on a consistent definition. This is the step most teams skip, and it makes every downstream comparison meaningless.
There are three categories of rework events to capture:
Category A: Revert commits
A revert commit is a git revert of a specific commit or PR that was deployed to production. GitHub automatically titles these commits "Revert '[original PR title]'" — making them relatively easy to identify from PR title pattern matching.
Revert commits are the clearest signal of rework: the team explicitly undid something that was shipped. The ambiguity is whether the revert was motivated by a production issue (true rework) or by a decision to change direction (planned rollback, not rework). The simplest way to disambiguate: a revert within 48 hours of the original deploy is rework. A revert after 48 hours is more likely to be a planned change.
Category B: Hotfix branches and PRs
Hotfix PRs are changes merged outside the normal release cycle specifically to address a production issue. They are identifiable by branch name patterns (hotfix/*, fix/prod-*, emergency/*) or PR labels that your team applies to track them.
The challenge with hotfixes is that they are easy to misclassify in both directions. Not all "hotfix" branches are actual production fixes — some teams use the convention for any urgent change, including non-production-impacting ones. And some actual hotfixes do not follow the naming convention because the engineer was moving fast. Establish a team norm: hotfix branch naming is required for any change that directly addresses a production incident.
Category C: Rollback deployments
A rollback deployment is a deployment that re-deploys a previous artifact to production. In GitHub Deployments API terms, this appears as a new deployment with a SHA that matches a prior deployment's SHA, or a deployment where the description indicates it is a rollback.
Rollbacks are distinct from reverts: a rollback returns the running system to a previous state without changing the codebase; a revert changes the codebase to undo a previous change. Both count as rework.
Step 2: Tag AI-Generated Changes at Merge Time
To compute rework rate separately for AI vs. human changes, you need to know at merge time which PRs contained significant AI-generated code. The most reliable method is a label or tag applied by the tool or the developer.
# GitHub Actions: auto-apply label when Copilot is used
name: Label AI PRs
on:
pull_request:
types: [opened, synchronize]
jobs:
detect-ai:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check for AI co-authorship
run: |
if git log --pretty=format:"%B" -1 | grep -qi "Co-authored-by.*copilot"; then
gh pr edit "$PR_NUMBER" --add-label "ai-assisted"
fi
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}Alternative approaches: require developers to self-report AI assistance in a PR template checkbox, or use the GitHub API to identify PRs opened by bot accounts (for autonomous agents like Devin).
Step 3: Calculate Rework Rate Segmented by AI vs. Human
from datetime import datetime, timedelta
import re
def is_rework_pr(pr, all_prs, lookback_hours=48):
"""Determine if a PR represents rework."""
title = pr.get("title", "").lower()
branch = pr.get("head", {}).get("ref", "").lower()
# Pattern A: Revert commit
if title.startswith("revert"):
return True, "revert"
# Pattern B: Hotfix branch naming
hotfix_patterns = [r"hotfix/", r"fix/prod", r"emergency/", r"fix/incident"]
if any(re.search(p, branch) for p in hotfix_patterns):
return True, "hotfix"
# Pattern C: Rollback label
labels = [l["name"] for l in pr.get("labels", [])]
if "rollback" in labels or "incident-fix" in labels:
return True, "rollback"
return False, None
def compute_rework_rates(merged_prs):
ai_total = 0
ai_rework = 0
human_total = 0
human_rework = 0
for pr in merged_prs:
labels = [l["name"] for l in pr.get("labels", [])]
is_ai = "ai-assisted" in labels or "ai-generated" in labels
is_rw, _ = is_rework_pr(pr, merged_prs)
if is_ai:
ai_total += 1
if is_rw:
ai_rework += 1
else:
human_total += 1
if is_rw:
human_rework += 1
return {
"ai_rework_rate": (ai_rework / ai_total * 100) if ai_total else 0,
"human_rework_rate": (human_rework / human_total * 100) if human_total else 0,
"ai_prs": ai_total,
"human_prs": human_total,
}Step 4: Build the Dashboard
A useful rework rate dashboard for AI quality tracking shows at minimum:
- Rework rate trend (30-day rolling): Separate lines for AI-assisted and human-written PRs. The primary comparison you want to watch.
- Rework rate by service or module: Which services have the highest AI-assisted rework? This identifies where AI tools are performing worst and where additional review gates are needed.
- Rework lag distribution: How many hours between the original PR and the rework event? Short lag (under 4 hours) indicates rapid detection — good incident response. Long lag (over 48 hours) may indicate subtle bugs that take time to surface.
- AI adoption percentage (same chart): Show AI adoption rate on the same time axis as rework rate. If rework rate is increasing in lockstep with AI adoption, the tools are the contributing factor. If they are moving independently, the cause may be elsewhere.
What a Good Result Looks Like
Teams that implement AI-aware rework rate tracking typically find one of three patterns:
Pattern 1: AI rework rate ≈ human rework rate. The AI tools are being reviewed effectively. The review process has adapted to account for AI-generated code. This is the target state.
Pattern 2: AI rework rate is 2–3× human rework rate. Review process has not adapted. The team is approving AI code too quickly, without sufficient coverage review or integration checking. Intervention: implement the coverage-first review process and CODEOWNERS enforcement for AI PRs.
Pattern 3: AI rework rate is >5× human rework rate. The AI tools are being used for changes that are beyond their capability — high-complexity, cross-cutting changes that require deep domain expertise. Intervention: restrict AI tool use to lower-risk change categories (isolated functions, test generation, documentation) and implement scope gates on AI PR submissions.
Koalr tracks rework rate by AI vs. human automatically
Koalr detects AI-assisted PRs, tracks rework events (reverts, hotfixes, rollbacks), and shows you the rework rate comparison side-by-side so you can measure whether your AI tools are helping or hurting delivery quality.
Track AI code quality with rework rate
Koalr automatically segments rework rate by AI-assisted vs. human-written changes, giving you the data to evaluate your AI tool ROI with outcome metrics rather than process proxies. Connect GitHub in 5 minutes.