7 signals that predict deployment failures before they happen

Why deployment risk prediction is different from DORA

The DORA framework, for all its value, is a rearview mirror. It tells you the aggregate story of your delivery performance over the last 30, 60, or 90 days. That story is genuinely useful — you need baseline measurements before you can improve. But no amount of DORA dashboard scrutiny will prevent the Friday afternoon deploy that takes down payments for three hours.

Deploy risk prediction is a different class of tool. Instead of asking "how did we do last month?" it asks "is this specific change, right now, likely to cause an incident?" The research basis for this question is substantial: academic work on software defect prediction spans two decades and dozens of peer-reviewed papers. The Microsoft Research team (Nagappan, Zimmermann, Bird, and others) demonstrated repeatedly that properties measurable from the version control graph — change size, author history, file churn rates — predict post-release defects with statistically significant accuracy.

The SRE community arrived at similar conclusions empirically. Talk to any experienced SRE and they will describe the same pattern: every major incident that crosses their postmortem desk had traceable warning signals that existed before the merge. The deployment that caused the incident was usually large, touched files outside the author's normal territory, was merged with minimal review, and often happened on a Friday afternoon. None of that was unobservable — it just was not being observed systematically.

Seven signals, specifically, have proven most predictive. They are all computable from data your team already generates: GitHub diffs, commit history, PR review records, test coverage reports, and deployment timestamps.

The seven signals

1. Change size (lines added + deleted)

Change size is the oldest and most studied signal in defect prediction research. The relationship is intuitive but the strength of it surprises most teams: larger diffs contain more surface area for bugs, more opportunities for merge conflicts to introduce subtle errors, and more cognitive load for reviewers — which means reviews of large diffs are systematically less thorough even when they take longer.

The raw line count is a coarse measure, but it is a robust one. Nagappan and Ball's foundational 2005 Microsoft study found that lines changed per module was among the top predictors of post-release defect density. Subsequent work at Google and Facebook has consistently replicated this finding across different codebases and languages.

The practical threshold: changes under 200 lines of non-test code carry low structural risk from size alone. Changes in the 200–500 line range warrant attention. Above 500 lines, the probability of introducing a defect that passes code review starts rising meaningfully. Above 1,000 lines, reviewer effectiveness research suggests the review is likely to miss at least one significant issue regardless of reviewer quality.

Data source: GitHub Pull Requests API — additions + deletions fields on the PR object. Koalr excludes auto-generated files (lockfiles, generated code, migrations) from the signal calculation to avoid false positives on dependency updates.

2. Files changed and directory spread

The number of distinct files changed, and specifically the number of distinct top-level directories or packages touched, is a proxy for blast radius. A change that modifies 50 files across three services introduces coupling risk — the deployed change affects multiple independent systems, any of which can fail in ways that interact unexpectedly.

This signal has a counter-intuitive property that many teams miss: a single-file change with 5,000 lines modified is often lower risk than a 50-file change with 100 lines each. The large single-file change is probably a refactor of one well-understood component. The 50-file change almost certainly crosses service or module boundaries, and each boundary crossing is an opportunity for a contract violation that tests may not catch.

In monorepos specifically, this signal is highly predictive. Cross-package changes in a monorepo fail more frequently than same-package changes of equivalent line count — because the package boundary is where interface contracts live, and contract violations are notoriously hard to catch before integration. Koalr uses the repository's package structure (package.json locations, CODEOWNERS team boundaries) to determine whether a change is cross-boundary.

3. Author expertise in changed files

Every file in your codebase has an implicit ownership history: the set of developers who have committed to it, how many times, and how recently. A developer making their first-ever commit to the payment processing service's core billing logic is in unfamiliar territory — even if they are a senior engineer with strong overall credentials.

The research on this is unambiguous. Mockus and Herbsleb (2002) found that "experience with the changed code" — measured by prior commits to the same files — was a significant predictor of post-integration defects. Bird et al.'s subsequent work at Microsoft confirmed the finding: developers modifying files outside their regular commit territory introduced defects at a substantially higher rate than developers modifying files they had touched recently and frequently.

Koalr calculates an expertise score per (author, file path) pair based on commit count, recency decay, and relative ownership share. A developer with 40 commits to a file in the last 90 days scores near 1.0 on that file. A developer making their first commit to that file scores 0.0. The PR-level expertise signal is the mean expertise score across all files the PR touches — weighted by the proportion of changes in each file.

This signal is particularly valuable for identifying "unfamiliar territory" risk in cases where the author is experienced overall. A principal engineer touching a service they have never worked in is higher risk on this signal than a junior developer modifying code they own.

4. Deployment timing (day-of-week and hour)

Friday deploys have become a cultural meme in software engineering — but the data behind the joke is real. Post-deployment incident rates follow clear temporal patterns that compound with other risk signals.

The mechanism is not mysterious: late-week, late-day deployments reduce the available response window. An incident that starts at 4:45pm on Friday has two strikes against it before a single alert fires — reduced on-call coverage and a weekend ahead. The engineers who authored the change are less likely to be available. Incident resolution time (MTTR) for Friday afternoon deployments is empirically longer than for equivalent incidents at other times.

The data from Koalr's incident attribution analysis across production deployments shows that failure rates for deployments after 4pm on Fridays run approximately 3x higher than the baseline rate for Tuesday and Wednesday morning deployments. The Thursday-afternoon window is a secondary peak — end-of-sprint pressure produces a cluster of large, last-minute PRs that accumulate risk from multiple signals simultaneously.

Data source: GitHub deployment event timestamps, correlated with the repository's timezone (inferrable from committer timezone data or organization settings). Koalr normalizes deployment timestamps to the engineering team's primary timezone for accurate timing signal calculation.

The timing multiplier effect

Timing is not a standalone risk signal — it is a multiplier. A medium-risk deployment (score 45) at 10am Tuesday is manageable. The same deployment at 5pm Friday becomes high-risk (score 71) purely from the timing penalty, because the consequences of failure are more severe and recovery is slower. Koalr applies timing as a weighted multiplier on the combined score from the other six signals.

5. Review coverage and thoroughness

Code review is one of the strongest quality gates in a software delivery pipeline — but the protective effect varies dramatically with how review is conducted. Not all approvals are equal. A PR approved by the author's direct teammate who glanced at the diff for 90 seconds is a materially different artifact than a PR approved after substantive back-and-forth between reviewers from different functional teams.

The review coverage signal has three components. First: was the PR reviewed at all before merge? PRs merged without any approvals are the clearest single risk indicator in the dataset — they skip the primary human quality gate entirely. Second: how many reviewers approved? Single-reviewer approvals catch fewer issues than multi-reviewer approvals; the marginal benefit of a second reviewer is larger than the marginal benefit of a third. Third: are the reviewers diverse? A PR reviewed and approved only by members of the same sub-team that authored it provides weaker guarantees than one reviewed by someone from outside that ownership boundary.

Koalr computes review coverage as a composite of these three factors, weighted by change size — a one-line fix merged without review is a different risk profile from a 500-line feature merged without review. CODEOWNERS data provides the team-boundary context for the reviewer diversity component: if all approvers own the same files as the author, the cross-team review signal is absent.

6. Historical failure rate of changed files

Files that have caused incidents before are more likely to cause incidents again. This is one of the most practically powerful signals in the model — and one that pure code metrics approaches miss entirely, because it requires connecting deployment history to incident history at the file path level.

The mechanism behind this signal has multiple explanations. Some files are inherently complex — they implement critical business logic, handle edge cases in external integrations, or sit at architectural decision boundaries where changes are disproportionately likely to surface latent assumptions. Some files are complex because of accumulated technical debt — years of "just add another case here" have produced code that is difficult to reason about and easy to break accidentally. Others are high-failure because the team lacks clear ownership, leading to many different developers making changes without deep context.

Whatever the underlying cause, the empirical signal is clear: if your team has had three incidents in the last 12 months that were attributed to deployments touching src/payments/charge.ts, that file carries elevated risk for the next change. Koalr attributes incidents to specific files by correlating the incident timestamp with the deployment that preceded it, then fingerprinting which files were changed in that deployment.

Anti-pattern this signal frequently reveals: teams repeatedly deploying to the same handful of files that keep failing. The deploy risk score surfaces this pattern explicitly — "5 of the files in this PR have been in incident-causing deployments in the last 90 days" — making the risk visible rather than tacit.

7. Test coverage delta

The seventh signal is also the one most commonly absent from other risk models, which is part of what makes it valuable for Koalr: a PR that significantly reduces test coverage on the files it changes is a meaningful warning signal for post-deployment incidents.

The intuition is straightforward — if you are modifying code and removing tests in the same operation, you are reducing your confidence in the correctness of the modified code at exactly the moment when you should be increasing it. A coverage drop of more than 5% on changed files is associated with meaningfully higher incident rates in subsequent deployments.

The data source is the per-PR coverage report from your coverage tool: Codecov, SonarCloud, Coveralls, or a custom lcov pipeline. Most coverage platforms provide a PR-level diff that shows coverage change on changed files specifically — not aggregate repository coverage, which can mask regressions in critical paths. Koalr ingests the PR-level coverage delta from Codecov's status checks API to compute this signal.

The signal requires coverage data to be present — a PR with no coverage report gets a neutral score on this dimension rather than a negative one, since many valid deployment contexts (infrastructure changes, documentation, config updates) do not have meaningful coverage metrics. But for teams with coverage pipelines in place, this signal adds substantial predictive power that the other six signals cannot provide.

How Koalr combines these signals into a 0–100 score

The seven signals are not simply summed. Each signal contributes differently depending on the context, and the signals interact in ways that matter for accurate prediction. A large change by an experienced author with strong review coverage and no coverage regression is a different risk profile from the same large change by a first-time contributor merged without review on a Friday afternoon.

Koalr computes a weighted composite score using a logistic regression model trained on historical deployment outcomes — specifically, the binary outcome of whether a deployment resulted in an incident within a defined attribution window. The model learns signal weights from the organization's own deployment history as it accumulates, which means the score becomes more accurate over time and increasingly reflects the specific risk patterns of your codebase rather than generic industry averages.

The output is a 0–100 integer score, displayed on every open PR in Koalr's PR dashboard and available as a GitHub status check. Score interpretation:

0–30

Low risk

Proceed with normal process. Small, well-reviewed changes by experienced authors in owned files.

31–65

Elevated risk

Worth a second look. Review the specific signals flagged — additional reviewer, coverage check, or timing adjustment may reduce the score.

66–100

High risk

Strong signals present. Consider decomposing the PR, adding tests, or deferring to a lower-risk deploy window.

The score deliberately avoids binary pass/fail semantics. A score of 72 is not a deploy block — it is a conversation starter. It surfaces the specific signals driving the score ("large diff, first-time author in 4 files, Friday 4pm") and leaves the merge decision with the engineer and their team.

Putting it in practice: what to do with a high risk score

A high deploy risk score surfaced before merge gives you three productive options. The right one depends on the nature of the signals driving the score.

Option 1: Decompose the PR. If the score is driven by change size and file spread, the most effective intervention is splitting the PR into smaller, more focused units. A 900-line PR touching authentication, billing, and the email service simultaneously almost always has a natural decomposition — the authentication changes can ship independently of the billing changes. Smaller PRs score lower, merge faster, and are easier to roll back if something goes wrong.

Option 2: Add reviewers. If the score is driven by review coverage — single reviewer, no cross-team review, or a reviewer with no ownership history in the changed files — the intervention is targeted: add a reviewer who owns the affected code. Koalr surfaces the suggested reviewers based on ownership history in the changed files, drawn from your CODEOWNERS file and commit history.

Option 3: Increase test coverage first. If the coverage delta signal is a primary contributor — the PR drops coverage by 8% on changed files — the cleanest resolution is writing the tests before merging. This is the intervention with the longest lead time but the most durable risk reduction. A PR that restores coverage to the pre-change level is removing a signal entirely rather than accepting the risk.

Don't gate all high-risk deploys automatically

Blocking every PR above a score threshold is tempting but counterproductive. It creates friction that engineers route around — by splitting PRs in ways that preserve risk while gaming the score, or by disabling the check entirely. The right model is friction-in-the-right-places: surface the score prominently, explain the signals, and let the team decide. Teams that use risk scores transparently improve their actual change failure rate faster than teams that use them as hard gates.

The goal is not zero high-risk deploys. Some high-risk changes are necessary — emergency security patches, critical infrastructure updates, time-sensitive product launches. The goal is that when a high-risk change ships, it does so with eyes open: the team has seen the score, understood the signals, made a conscious decision to proceed, and ideally has a rollback plan ready.

Teams that adopt this practice consistently report that the risk score changes the pre-merge conversation. Instead of "LGTM" or silence, there is a moment of shared situational awareness — "this one scores 78, the big driver is that we're touching five files where we had incidents last quarter, let's add Sarah as a reviewer." That is the behavior the score is designed to enable.

For a deeper look at how deploy risk scoring works in the Koalr platform — including the PR-level signal breakdown and GitHub status check integration — see the deploy risk feature page or try the deploy risk calculator to see how your current open PRs would score.