Why PR Size Increases Incident Risk

The mechanism is dual: larger PRs are both harder to review effectively and carry a larger blast radius when something goes wrong.

On the review side, there is a well-documented phenomenon in code review research: reviewer defect detection rate drops sharply after approximately 400 lines of changed code in a single review session. A review of a 500-line PR typically catches fewer bugs per 100 lines than a review of a 100-line PR, even when both are reviewed by the same person with the same amount of time. Reviewers switch from deep analysis to pattern matching — they look for obviously wrong things rather than carefully tracing control flow through the entire change.

On the blast radius side, a 500-line PR touches more code paths, more potential integration points, and more edge cases than a 100-line PR. When a bug is present, it has more vectors through which it can manifest in production. And when an incident occurs, the debugging radius is correspondingly larger — the on-call engineer needs to understand which of 500 lines of changes caused the problem.

The Review Quality Cliff

A classic study by Cisco Systems found that the optimal code review session is 60–90 minutes reviewing 200–400 lines of code. Beyond that, defect density per line reviewed drops dramatically. The study found that reviewers find roughly 70–80% of defects in optimal conditions and 20–30% of defects in a 1000-line review.

This creates a perverse incentive structure in engineering teams: the PRs that most need careful review (large, complex changes) are also the PRs that are least effectively reviewed. A 1000-line PR that takes 3 hours to review has already exceeded the effective review window for most human reviewers.

PR Size Range	Avg Review Time	Defect Detection	Incident Rate
1–50 lines	8 min	85%+	3%
51–200 lines	22 min	75–85%	5%
201–500 lines	48 min	55–70%	8%
501–1000 lines	87 min	35–50%	12%
1000+ lines	2+ hours	20–35%	18%+

How Change Entropy Interacts with Size

Raw line count is an incomplete signal. A 500-line PR that adds a new feature to a single well-defined module is different from a 500-line PR that touches 40 different files across 8 services. Change entropy — the dispersion of changed lines across the file tree — amplifies size risk.

A large PR with low entropy (500 lines in 2–3 files) is reviewable: the reviewer can focus on a defined area. A large PR with high entropy (500 lines across 30 files) is effectively unreviewable as a whole — the reviewer is context-switching between different parts of the codebase with different ownership, different test requirements, and different risk profiles.

When scoring deployment risk, size and entropy should be combined multiplicatively rather than additively. A PR with line count above 500 AND normalized entropy above 0.7 carries substantially more risk than either threshold alone would suggest.

The Excluded Lines Problem

A common objection to PR size limits is that large line counts are misleading when they include generated code, lockfiles, or test data. This is valid. A PR that adds 400 lines of generated protobuf definitions should not be treated the same as a PR that adds 400 lines of handwritten application logic.

When calculating PR size for risk scoring, exclude or weight separately:

Generated files (matching patterns like *.generated.ts, *.pb.go)
Lockfiles (package-lock.json, yarn.lock, go.sum)
Test fixtures and mock data files
Asset files (SVGs, images embedded as code)
Configuration files with many similar repeated entries

The meaningful line count for risk purposes is the "handwritten logic" line count — the lines that require actual human understanding to review correctly.

Enforcing PR Size Without Creating Friction

The goal is not to block all large PRs — it is to make the cost of a large PR visible at the moment it is created, so engineers make the decision consciously rather than by default.

The most effective implementation: a check run that posts a size warning when a PR exceeds a threshold (e.g., 400 meaningful lines) with a specific message explaining what the size means for review quality and incident risk. The check run should not block merge by default for size alone — blocking creates resentment and workarounds. Instead, it should elevate the risk score, which may cause the PR to require additional reviewer or CODEOWNERS sign-off.

def calculate_meaningful_lines(pr_files: list[dict]) -> int:
    """
    Calculate meaningful (non-generated) lines changed in a PR.
    pr_files: list of file objects from GitHub API
    """
    EXCLUDED_PATTERNS = [
        r".lock$",          # Lockfiles
        r".generated.",    # Generated files
        r".pb.go$",        # Protobuf Go
        r"_pb2.py$",        # Protobuf Python
        r"fixtures/",        # Test fixtures directory
        r"__snapshots__/",   # Jest snapshots
        r".snap$",          # Snapshot files
    ]

    total_meaningful = 0
    for file in pr_files:
        filename = file["filename"]
        is_excluded = any(re.search(p, filename) for p in EXCLUDED_PATTERNS)
        if not is_excluded:
            # additions + deletions
            total_meaningful += file.get("additions", 0) + file.get("deletions", 0)

    return total_meaningful

When Large PRs Are Unavoidable

Some changes are inherently large: major dependency upgrades, cross-cutting refactors, database schema migrations paired with application changes. For these PRs, the right response is not to block them but to route them through a more rigorous review process.

For large unavoidable PRs, the practices that reduce incident rate most are:

Stacked PRs where possible: Split the change into a sequence of smaller PRs that each pass a deployable intermediate state. Even if the final PR is large, reviewers can understand the progression.
Designated domain reviewers: For a high-entropy large PR, each CODEOWNERS team reviews only their section — rather than one reviewer trying to understand the entire change.
Canary deployment: Deploy to 5–10% of production traffic first, watch error rates for 30 minutes, then complete the rollout. This limits blast radius even if the review process missed something.

Koalr scores PR size and entropy as combined risk signals

Koalr calculates meaningful (non-generated) line counts, combines them with change entropy, and weights the combined signal in the deploy risk score — giving reviewers a quantified risk signal rather than a raw line count they have to interpret.

Why Large PRs Are More Than an Annoyance: The Data on PR Size and Incidents