What Is Blast Radius in Software Deployments (And How to Model It)
In a microservices architecture, no service is an island. When payments-service goes down, the checkout flow breaks. When auth-service degrades, every service behind it stalls. Blast radius is the measure of how far that failure travels — and most teams only find out after the incident. Here is how to model it before you merge.
What blast radius means in software
The term comes from military engineering, where it describes the area of effect of an explosion. In software, it describes the set of services, users, or systems that are affected when a specific component fails.
In a monolith, blast radius is simple: one deployment, one failure surface. In a microservices architecture, it is a graph problem. Every service has upstream callers and downstream dependencies. A failure propagates along those edges, and the question is how far and how severely.
Blast radius modeling answers two questions before you deploy:
- →If this service fails, which downstream services are at risk?
- →How severe is the risk at each hop in the dependency chain?
The service dependency graph
To model blast radius, you first need a service dependency graph: a directed graph where nodes are services and edges represent call relationships. Service A calls Service B means there is a directed edge from A to B — if B fails, A is affected.
In practice, this graph comes from three sources:
- 1.Declared dependencies — service manifests, API gateway configurations, or service mesh topology
- 2.Observed traffic — actual call patterns from your observability stack (Datadog APM, Jaeger, OpenTelemetry traces)
- 3.Incident history — which services actually went down together in past incidents, capturing implicit dependencies that never make it into manifests
The incident history source is underused. Two services that consistently co-fail even without a declared dependency often share a database, a Redis cluster, or a third-party API. Historical co-failure is a more honest dependency map than any documentation.
How risk propagates hop-by-hop
Once you have the graph, you can propagate risk through it. The model Koalr uses works like this:
The propagation formula
propagated_score = source_score × decay_factor^hop_count × coupling_weight
- source_score — the deploy risk score of the service being deployed (0–100)
- decay_factor — typically 0.6–0.75, meaning risk attenuates with each hop
- hop_count — number of edges from the failing service to the affected service
- coupling_weight — 0.0–1.0 representing how tightly coupled the services are (call frequency, shared data stores, SLA dependency)
So if payments-service has a risk score of 80, a Hop 1 neighbor with tight coupling (0.9) and a decay factor of 0.7 would receive a propagated score of 80 × 0.7¹ × 0.9 = 50.4. A Hop 2 neighbor with loose coupling (0.3) would receive 80 × 0.7² × 0.3 = 11.8.
The exponential decay matters because in real systems, failures do attenuate. Circuit breakers, retries, graceful degradation, and feature flags all reduce propagation. Linear propagation models overstate downstream risk and produce alert fatigue.
Why Hop 1 neighbors matter most
In practice, the engineering insight from blast radius modeling comes almost entirely from Hop 1 — the direct callers of the service being deployed. These are the services with no buffer: if your service returns 500s, they see 500s. If your service adds 200ms of latency, they add 200ms to their p99.
What makes this useful pre-merge is that Hop 1 callees are often owned by different teams. The payments-service team may not know that recommendation-service has added a synchronous call to their checkout path in the last sprint. Blast radius visualization surfaces that dependency — and the on-call contact for the downstream team — before the deploy window opens.
At Hop 2 and beyond, the signal-to-noise ratio drops. By the time risk has propagated two or three hops through circuit breakers and independent retry budgets, the propagated score is usually low enough to be informational rather than actionable.
Blast radius as a deploy gate
The most direct use of blast radius modeling is as a pre-merge gate. If the source service has a risk score above a threshold and the Hop 1 propagated score for any downstream service also exceeds a threshold, the deploy is flagged for human review.
This is more useful than gating on source risk alone because it surfaces the systemic exposure. A moderate-risk change to a service with 12 tight downstream callers is more dangerous than a high-risk change to an isolated service with no callers. The source risk score tells you one thing; the blast radius tells you the system-level consequence.
Concretely, the workflow looks like this:
- 1PR opens. Koalr calculates the deploy risk score for the source service (0–100, 36-signal model).
- 2Blast radius is computed by walking the dependency graph from the source service, calculating propagated scores at each hop.
- 3If any downstream service exceeds the propagated risk threshold, the PR is flagged and the downstream team owner is notified.
- 4The PR author and downstream owners see the full hop-by-hop breakdown before approving the merge.
The dependency graph maintenance problem
The hardest part of blast radius modeling is not the algorithm — it is keeping the dependency graph current. In fast-moving teams, service dependencies change weekly. New internal clients appear. Deprecated call paths linger in the graph. The coupling weight of a relationship changes when teams switch from synchronous HTTP calls to async event queues.
Three practices reduce graph staleness:
- →Continuous trace-based discovery. Ingest OpenTelemetry spans and update edge weights from actual traffic patterns weekly. Declared dependencies become a fallback, not the primary source.
- →Co-failure mining. When an incident resolves, analyze which services went down together and add or weight edges accordingly.
- →Team-owned graph segments. Service owners attest to their callers and callees as part of the onboarding process, and the graph UI makes ownership visible so staleness is noticed.
Blast Radius in Koalr
Koalr's blast radius tool lets you input any service name and risk score and see the full hop-by-hop propagation tree with propagated scores and risk badges, running against your actual service dependency graph. See the blast radius feature →