Two Questions, One Confused Dashboard
Engineering metrics platforms typically offer two types of risk views. The first shows portfolio-level health: are sprints slipping? Is cycle time trending up? Is throughput declining? The second — far less common — shows per-deploy safety: is this specific pull request likely to cause an incident when it ships?
These are not the same question. The first is delivery risk. The second is deployment risk. Treating them as interchangeable leaves engineering leaders with a dashboard full of lagging indicators and no early warning system for the next outage.
What Delivery Risk Measures
Delivery risk is portfolio-level and time-lagged. It answers: are we on track to deliver what we committed to, by when? The signals it tracks — cycle time, sprint velocity, throughput, WIP limits, PR merge rate — are aggregated across dozens of engineers and weeks of work. They are excellent for capacity planning, roadmap confidence, and quarterly planning conversations with product.
But delivery risk metrics are inherently retrospective. By the time cycle time trends upward enough to register as a risk signal, the structural problems causing it — growing code ownership gaps, accumulating technical debt, review bottlenecks — have already been present for weeks. Delivery risk tells you the org is slowing down. It cannot tell you which deploy tonight will take down production.
What Deployment Risk Measures
Deployment risk is per-deploy and predictive. It answers: given everything we know about this specific pull request — who wrote it, what they changed, how it was reviewed, what the service's current incident history looks like — what is the probability this deploy causes a failure?
The signals that predict deployment failures are not the same as the signals that predict delivery failures. Academic research across Google, Microsoft, and Mozilla codebases has identified more than 35 per-deploy factors with documented predictive power. The strongest include:
| Signal | What It Captures | Source |
|---|---|---|
| DDL migrations in PR | Schema changes that interact with live traffic | Kim et al., MSR 2008 |
| Historical failure rate | Prior incident frequency for this service | Kim & Whitehead, MSR 2008 |
| Change entropy | Dispersion of changes across subsystems | Hassan, ICSE 2009 |
| SLO error budget burn rate | System health at deploy time | Google SRE Book, 2016 |
| Author file expertise | Author familiarity with changed files | Bird et al., FSE 2011 |
Notice what these signals have in common: they are all per-deploy. Change entropy measures how spread across subsystems a single PR is. Author file expertise scores how familiar the committer is with the specific files they touched. DDL detection flags schema migrations in this PR. None of these can be measured at the portfolio level — they only exist at the individual deploy level.
The 4 AM Problem
Here is the operational difference between the two concepts: delivery risk signals are useful in sprint planning; deployment risk signals are useful at 4 AM.
When an on-call engineer gets paged at 4 AM, the questions on their mind are not “is our cycle time trending up?” They want to know: what changed? Which deploy is most likely to have caused this? Was this deploy flagged as high-risk before it shipped?
Delivery risk dashboards are silent at 4 AM. A deployment risk score, computed before the deploy went out, is the first artifact an engineer should reach for during incident triage. If a high-risk deploy happened 20 minutes before the incident started, that is not a coincidence — it is a hypothesis.
The CFR Connection
DORA's Change Failure Rate (CFR) is often discussed as a delivery metric. But it sits at the intersection of both concepts: it is measured per-deploy (which deploys caused incidents?), but it is reported as a portfolio trend (what percentage of deploys failed this quarter?).
The difference between delivery risk tooling and deployment risk tooling is this: delivery risk platforms can measure CFR after the fact. Deployment risk platforms canpredict which individual deploys will contribute to CFR before they ship. The goal is to intervene at the PR level — route high-risk deploys to senior reviewers, require additional sign-off, notify the on-call — rather than accept CFR as a lagging outcome and count incidents afterward.
Why Delivery Risk Metrics Cannot Predict Incidents
The fundamental problem with using delivery metrics to predict incidents is resolution mismatch. A deploy happens in minutes. Delivery metrics aggregate over days or weeks. By the time cycle time signals a problem, hundreds of deploys have already gone out.
There is also a causation issue. High cycle time correlates with risk, but it does not tell you which specific deploy to worry about today. Two PRs with identical cycle time numbers can have radically different incident probabilities: one touches three files in a stable service with 95% test coverage; the other touches a payment critical path that has had three incidents in 60 days, was written by someone who has never touched that subsystem before, and includes an ALTER TABLE migration.
Delivery metrics cannot distinguish these two PRs. Deployment risk signals can.
What to Look for in a Deployment Risk Tool
When evaluating engineering platforms that claim to measure deployment risk, there are four questions to ask:
- Is the score per-deploy or per-team? A team-level risk score is a delivery metric with different branding. True deployment risk scoring must produce a distinct score for each PR before it merges.
- Which signals power the score?Any tool can label a number “risk score.” Ask for the specific signals and their documented predictive power. If the answer is “our proprietary algorithm,” ask for the underlying research backing it.
- Does it integrate incident data? Historical failure rate is one of the strongest deployment risk signals. A deployment risk tool that does not ingest PagerDuty, OpsGenie, Sentry, or Rollbar data is missing its most predictive input.
- Can engineers act on it before the deploy? A score surfaced post-deploy in a dashboard is better than nothing but still lagging. The highest-value implementation surfaces risk scores in the PR itself — where engineers can slow down, add reviewers, or split the PR before anything reaches production.
You Need Both
This is not an argument that delivery risk is unimportant. Engineering leaders need both views. Portfolio-level delivery metrics tell you whether the org is healthy and whether commitments are at risk. Per-deploy deployment risk scoring tells you whether tonight's release is safe to ship.
The problem is that most platforms optimize heavily for delivery risk — the metrics are easier to compute, easier to explain to executives, and do not require integrating deeply with the deploy pipeline. Deployment risk requires GitHub integration, incident data, test coverage data, and a scoring model backed by actual research. It is harder to build, which is why most platforms skip it.
But the value is asymmetric. A well-calibrated deployment risk score, surfaced to the right engineer 30 minutes before a high-risk deploy, can prevent an incident entirely. A delivery risk dashboard, no matter how polished, can only tell you the incident happened too many times last quarter.