Why Large PRs Are More Than an Annoyance: The Data on PR Size and Incidents
The engineering community has long advocated for small PRs on productivity grounds — faster reviews, less context switching, quicker iteration cycles. But the relationship between PR size and production incidents is the more compelling argument, and it is quantifiable. At a certain PR size, the incident rate increases nonlinearly. At another size, even the best reviewers stop actually reviewing the change.
The nonlinear threshold
PRs under 200 lines changed have a 5% incident rate. PRs between 200–500 lines have an 8% incident rate. PRs over 500 lines have a 14% incident rate. The jump from 200–500 to 500+ is not a linear extrapolation — it represents a phase change in reviewer comprehension and blast radius.
Why PR Size Increases Incident Risk
The mechanism is dual: larger PRs are both harder to review effectively and carry a larger blast radius when something goes wrong.
On the review side, there is a well-documented phenomenon in code review research: reviewer defect detection rate drops sharply after approximately 400 lines of changed code in a single review session. A review of a 500-line PR typically catches fewer bugs per 100 lines than a review of a 100-line PR, even when both are reviewed by the same person with the same amount of time. Reviewers switch from deep analysis to pattern matching — they look for obviously wrong things rather than carefully tracing control flow through the entire change.
On the blast radius side, a 500-line PR touches more code paths, more potential integration points, and more edge cases than a 100-line PR. When a bug is present, it has more vectors through which it can manifest in production. And when an incident occurs, the debugging radius is correspondingly larger — the on-call engineer needs to understand which of 500 lines of changes caused the problem.
The Review Quality Cliff
A classic study by Cisco Systems found that the optimal code review session is 60–90 minutes reviewing 200–400 lines of code. Beyond that, defect density per line reviewed drops dramatically. The study found that reviewers find roughly 70–80% of defects in optimal conditions and 20–30% of defects in a 1000-line review.
This creates a perverse incentive structure in engineering teams: the PRs that most need careful review (large, complex changes) are also the PRs that are least effectively reviewed. A 1000-line PR that takes 3 hours to review has already exceeded the effective review window for most human reviewers.
| PR Size Range | Avg Review Time | Defect Detection | Incident Rate |
|---|---|---|---|
| 1–50 lines | 8 min | 85%+ | 3% |
| 51–200 lines | 22 min | 75–85% | 5% |
| 201–500 lines | 48 min | 55–70% | 8% |
| 501–1000 lines | 87 min | 35–50% | 12% |
| 1000+ lines | 2+ hours | 20–35% | 18%+ |
How Change Entropy Interacts with Size
Raw line count is an incomplete signal. A 500-line PR that adds a new feature to a single well-defined module is different from a 500-line PR that touches 40 different files across 8 services. Change entropy — the dispersion of changed lines across the file tree — amplifies size risk.
A large PR with low entropy (500 lines in 2–3 files) is reviewable: the reviewer can focus on a defined area. A large PR with high entropy (500 lines across 30 files) is effectively unreviewable as a whole — the reviewer is context-switching between different parts of the codebase with different ownership, different test requirements, and different risk profiles.
When scoring deployment risk, size and entropy should be combined multiplicatively rather than additively. A PR with line count above 500 AND normalized entropy above 0.7 carries substantially more risk than either threshold alone would suggest.
The Excluded Lines Problem
A common objection to PR size limits is that large line counts are misleading when they include generated code, lockfiles, or test data. This is valid. A PR that adds 400 lines of generated protobuf definitions should not be treated the same as a PR that adds 400 lines of handwritten application logic.
When calculating PR size for risk scoring, exclude or weight separately:
- Generated files (matching patterns like
*.generated.ts,*.pb.go) - Lockfiles (
package-lock.json,yarn.lock,go.sum) - Test fixtures and mock data files
- Asset files (SVGs, images embedded as code)
- Configuration files with many similar repeated entries
The meaningful line count for risk purposes is the "handwritten logic" line count — the lines that require actual human understanding to review correctly.
Enforcing PR Size Without Creating Friction
The goal is not to block all large PRs — it is to make the cost of a large PR visible at the moment it is created, so engineers make the decision consciously rather than by default.
The most effective implementation: a check run that posts a size warning when a PR exceeds a threshold (e.g., 400 meaningful lines) with a specific message explaining what the size means for review quality and incident risk. The check run should not block merge by default for size alone — blocking creates resentment and workarounds. Instead, it should elevate the risk score, which may cause the PR to require additional reviewer or CODEOWNERS sign-off.
def calculate_meaningful_lines(pr_files: list[dict]) -> int:
"""
Calculate meaningful (non-generated) lines changed in a PR.
pr_files: list of file objects from GitHub API
"""
EXCLUDED_PATTERNS = [
r".lock$", # Lockfiles
r".generated.", # Generated files
r".pb.go$", # Protobuf Go
r"_pb2.py$", # Protobuf Python
r"fixtures/", # Test fixtures directory
r"__snapshots__/", # Jest snapshots
r".snap$", # Snapshot files
]
total_meaningful = 0
for file in pr_files:
filename = file["filename"]
is_excluded = any(re.search(p, filename) for p in EXCLUDED_PATTERNS)
if not is_excluded:
# additions + deletions
total_meaningful += file.get("additions", 0) + file.get("deletions", 0)
return total_meaningfulWhen Large PRs Are Unavoidable
Some changes are inherently large: major dependency upgrades, cross-cutting refactors, database schema migrations paired with application changes. For these PRs, the right response is not to block them but to route them through a more rigorous review process.
For large unavoidable PRs, the practices that reduce incident rate most are:
- Stacked PRs where possible: Split the change into a sequence of smaller PRs that each pass a deployable intermediate state. Even if the final PR is large, reviewers can understand the progression.
- Designated domain reviewers: For a high-entropy large PR, each CODEOWNERS team reviews only their section — rather than one reviewer trying to understand the entire change.
- Canary deployment: Deploy to 5–10% of production traffic first, watch error rates for 30 minutes, then complete the rollout. This limits blast radius even if the review process missed something.
Koalr scores PR size and entropy as combined risk signals
Koalr calculates meaningful (non-generated) line counts, combines them with change entropy, and weights the combined signal in the deploy risk score — giving reviewers a quantified risk signal rather than a raw line count they have to interpret.
Surface PR size and entropy risk automatically
Koalr calculates meaningful line counts, change entropy, and combined size risk for every PR — posting risk scores to GitHub before reviewers spend time on PRs that should have been split. Connect GitHub in 5 minutes.