Why your code coverage % is lying to you (and what to measure instead)

The problem with line coverage as a metric

Line coverage — the percentage of lines of code that are executed by at least one test — is the most reported code quality metric in engineering. It is also among the least useful. The number tells you how much of your code a test suite touches. It tells you nothing about whether the tests are meaningful, whether the critical paths are protected, or whether the covered code is actually correct.

100% line coverage is achievable while missing entire business logic branches. A function that processes a payment and handles five different error conditions can show 100% line coverage if a test calls it once with a valid card — as long as each line of code is executed at least once. The four error-handling branches are never exercised. The coverage report shows green.

The most common form of gaming is tests that call code without meaningful assertions. Coverage tooling records that a line was executed; it does not record whether the test verified the output. A test that calls processPayment() and asserts nothing — or asserts only that no exception was thrown — contributes to the line coverage percentage as much as a test that verifies every output, side effect, and error case. From the coverage report's perspective, they are identical.

The benchmark trap compounds the problem. Engineering teams set a coverage threshold — 80% is the most common — and optimize for reaching it. Once the threshold is hit, the number is treated as proof of test quality rather than a proxy for it. Teams celebrate crossing 80% and then stop asking whether the tests that produce that number are testing anything meaningful. The threshold becomes the goal rather than the starting point.

The five coverage metrics that actually matter

1. Coverage delta on changed files in a PR. Org-wide coverage percentage is an aggregate that hides more than it reveals. A single team adding 500 well-tested lines can mask another team shipping 200 untested lines in the same week — the org-wide number goes up, the risk goes up with it, and neither is visible. The metric that is actually actionable is the coverage change on the specific files being modified in a pull request. “This PR dropped coverage by 3.2% on src/billing/stripe.ts” is information a developer can act on before merge. “Org-wide coverage is 79.4%” is not.

Coverage delta at the PR level is what Codecov and SonarCloud both surface natively — and it is the signal that correlates with deploy outcomes rather than the aggregate number.

2. Branch coverage vs line coverage. Lines measure execution paths; branches measure conditional logic. A function with a 10-branch if/else chain — covering different payment methods, edge cases, and error conditions — can show 100% line coverage if a test exercises two branches, as long as every line in those branches is executed. Branch coverage tracks whether each condition in a conditional statement was evaluated as both true and false. It is harder to achieve, takes longer to write tests for, and is proportionally more meaningful for complex business logic. If you are only measuring one coverage type, branch coverage is the right choice for logic-heavy modules.

3. New code coverage. SonarCloud surfaces this as the new_coverage metric: the coverage percentage calculated only on code that was added or modified in the current pull request. This is the quality gate question in its clearest form: “Is the code we are shipping today tested?” It strips away the historical debt of legacy untested code and focuses enforcement on the code that is actually changing. A repository with 40% overall coverage can still require that new code meets an 80% threshold — holding teams accountable for what they are adding without requiring them to fix all existing gaps first.

4. Test execution time trend. Coverage that takes two hours to generate is coverage that gets skipped. When a CI suite grows to the point where running it becomes painful, developers begin selectively skipping tests or marking them as non-blocking. The coverage data stops reflecting reality. Tracking test execution time as a trend — week over week, not just in absolute terms — catches suite bloat before it reaches the point where teams start working around it. A sudden jump in test execution time after a PR merge is often a sign that an inefficient test was added that needs to be addressed before it slows down the entire team.

5. Flaky test rate. A test that intermittently passes and fails is worse than no test. It trains developers to re-run failing builds rather than investigating failures, which means real failures get ignored alongside false ones. A 5% flaky test rate across a 1,000-test suite means 50 tests that cannot be trusted — and real bugs that slip through under cover of “it's probably flaky.” Datadog CI Visibility, Cypress Cloud, and BuildPulse surface flaky rate natively by tracking test pass/fail history across runs. Treating flaky rate as a first-class metric — with a threshold that triggers investigation — is more impactful than increasing overall coverage percentage for most mature test suites.

How to integrate Codecov with GitHub Actions

Codecov is the most widely used coverage aggregation tool for GitHub-hosted repositories. Setup requires uploading a coverage report from your CI run, which a single workflow step handles:

- name: Upload coverage to Codecov
  uses: codecov/codecov-action@v4
  with:
    token: ${{ secrets.CODECOV_TOKEN }}
    files: ./coverage/lcov.info
    flags: payments          # team-level segmentation
    fail_ci_if_error: true
    verbose: true

The flags system is Codecov's most useful feature for multi-team organizations. Each CI upload can be tagged with one or more flag names — typically team names or service names — and Codecov will aggregate coverage separately per flag. A payments team can see their service's coverage trend independently of the platform team's coverage, and both roll up into the org-wide view. This is how you get per-team accountability without requiring separate repositories.

When a pull request is opened, Codecov posts a comment showing the coverage delta for every changed file — not the aggregate, but file by file. This is the PR-level signal described above, surfaced automatically without any additional configuration. A developer can see at a glance which files they modified lost coverage and by how much, before the PR is reviewed.

Codecov also integrates as a GitHub status check, allowing you to fail PRs when coverage drops below a threshold. Configure this in the Codecov UI under Settings → General → Coverage Threshold. The status check blocks merge in the same way branch protection does — though as with CODEOWNERS enforcement, tiering the threshold by path criticality produces better outcomes than a single org-wide number.

SonarCloud: the enterprise alternative

SonarCloud takes a different approach than Codecov. Where Codecov specializes in coverage aggregation and per-team segmentation, SonarCloud combines coverage with code smell detection, security hotspot identification, and technical debt estimation in a single platform. For teams that want a single quality gate covering multiple signal types, SonarCloud is the stronger choice.

The key metric for quality gates is new_coverage — coverage calculated on changed code only, as described above. Access it via the SonarCloud API:

GET https://sonarcloud.io/api/measures/search_history
  ?component=your-org_your-repo
  &metrics=new_coverage,coverage,new_violations
  &from=2026-01-01

Configuring a quality gate that fails when new_coverage drops below 80% enforces coverage on new code without requiring teams to fix all existing gaps first. This is the most practical way to introduce coverage requirements into a repository that has historically had low coverage — start enforcing for new code, track the aggregate separately, and let the legacy coverage improve organically over time.

The comparison with Codecov is worth being direct about: SonarCloud adds code smell and security hotspot detection that Codecov does not have. Codecov has better per-flag team segmentation and more granular per-file PR comment detail. For pure coverage-as-deploy-signal use cases, either works. For teams that want to correlate coverage with security posture and code quality in a single dashboard, SonarCloud is the stronger platform.

Coverage as a deploy risk signal

The correlation between low test coverage and production incidents is empirically documented. Repositories with less than 40% line coverage show approximately three times the change failure rate of repositories above 70% coverage, holding other variables constant. But the aggregate repository coverage number is a weak signal — the file-level combination of coverage and churn is much stronger.

The hotspot formula is straightforward: low coverage multiplied by high churn equals highest-risk files. A file that is modified frequently (high churn over the last 30 days) and has low test coverage is the file most likely to introduce a production regression when changed. The coverage data tells you the safety net is thin; the churn data tells you the file is being changed often enough that the thin safety net matters.

The practical deployment workflow: before merging a pull request, check whether the changed files are in the high-churn, low-coverage zone. A PR that touches src/billing/stripe.ts — which has 28% coverage and has been modified 14 times in the last 30 days — warrants scrutiny that a PR touching a stable, well-tested utility file does not. This is coverage as a contextual signal, not as an aggregate number.

Coverage delta on changed files is the seventh factor in Koalr's deploy risk score. A PR that drops coverage by more than a configured threshold on a high-churn file triggers an elevated risk score, which surfaces in the deploy risk panel before the PR is merged. The signal is not “org coverage is 78%” — it is “this specific PR, touching these specific files, is a risk because coverage just dropped and this code changes frequently.”

Setting meaningful coverage thresholds

Not all code needs equal coverage. Applying a single org-wide threshold treats the payment processing module identically to an auto-generated GraphQL schema file — which makes coverage requirements feel arbitrary and produces the wrong incentives. A more principled approach tiers thresholds by code criticality:

Payments, auth, and security-critical paths — 90%+ branch coverage, enforced as a hard block on merge. These are the paths where production bugs have the highest consequence and where the cost of fixing a coverage gap is lowest relative to the cost of a production incident.
Core business logic — 80%+ line coverage, enforced as a block on merge. The default Codecov quality gate threshold is reasonable here.
UI components and presentation layer — 60%+, warn on drop. Visual components are harder to test meaningfully and lower risk for most failure modes.
Generated code, migrations, and config — excluded from coverage requirements entirely. These paths are either deterministically correct (generated code) or tested differently (migrations run against a real database).

Codecov supports per-directory threshold configuration through its codecov.yml file, allowing you to specify different thresholds for different path patterns. SonarCloud supports per-project quality gate configuration at the organization level.

For teams starting from zero — with no existing coverage infrastructure and low aggregate coverage — the right approach is to ignore the aggregate number initially. Identify the five to ten files that have the highest churn and the highest consequence on failure. Write tests for those files first. The aggregate coverage will be low; the risk profile will improve immediately. Blanket percentage targets, approached from zero, produce tests written to satisfy the number rather than to protect the code. Critical-path-first produces tests that protect deployments from the start.