SLO Burn Rate: The Signal Nobody Watches Until It's Too Late
Most deployment risk conversations focus on the change being deployed — the PR's size, author expertise, coverage quality. But risk is a product of both the change and the environment the change is deployed into. Deploying a low-risk change into a system that is already consuming its error budget at 3× the normal rate is not a low-risk deployment. SLO burn rate at deploy time is one of the most actionable and least-used signals in deployment risk management.
The finding
Deployments made when the service's SLO burn rate is above 2× have a 2.8× higher incident rate than deployments made at normal burn rates — even after controlling for the risk profile of the change itself.
SLO Basics: Error Budgets and Burn Rate
A Service Level Objective (SLO) defines the target reliability for a service — for example, 99.9% availability per month. The error budget is the inverse: the amount of downtime the SLO allows within the measurement period. For 99.9% availability over a 30-day month, the error budget is 43.8 minutes.
Burn rate measures how quickly you are consuming the error budget relative to the expected rate. If you have 43.8 minutes of budget for the month and are burning it at the rate of 10 minutes per day, your burn rate is roughly 7× (you would exhaust the budget in 4.4 days instead of 30).
The formal burn rate calculation:
# Burn rate = (actual error rate / allowed error rate)
# Where allowed error rate = 1 - SLO target
def burn_rate(current_error_rate: float, slo_target: float) -> float:
"""
current_error_rate: fraction of requests failing (e.g., 0.002 = 0.2%)
slo_target: SLO target (e.g., 0.999 = 99.9%)
Returns: burn rate multiplier (1.0 = normal, >1 = burning faster than budget)
"""
allowed_error_rate = 1 - slo_target
return current_error_rate / allowed_error_rate
# Example: 99.9% SLO, currently seeing 0.5% error rate
# allowed = 0.001, current = 0.005
# burn_rate = 0.005 / 0.001 = 5.0x — consuming budget 5x faster than expected
rate = burn_rate(current_error_rate=0.005, slo_target=0.999)
print(f"Burn rate: {rate:.1f}x") # 5.0xWhy High Burn Rate Increases Deployment Risk
When a system is already consuming its error budget faster than expected, it is telling you something is wrong — even if no one has paged yet. The underlying causes might be: increased traffic that is exposing a latent scaling issue, a subtle bug introduced in a recent deployment, degraded infrastructure, or an upstream dependency that is intermittently failing.
Deploying a new change into this environment creates multiple compounding risks:
Attribution confusion. When a deployment happens during elevated burn rate and the error rate gets worse afterward, it is extremely difficult to determine causation. Did the deployment cause the degradation, or was the degradation already occurring and the deployment is unrelated? This confusion slows incident response because the team is investigating the wrong thing.
SLO breach risk. If the deployment causes any additional errors — even a small amount — it may push an already-stressed system over the SLO threshold. A system burning budget at 3× is one small incident away from an SLO breach. Deploying into that state is accepting unnecessary risk.
Rollback complications. Rolling back a deployment in a system that is already degraded is significantly more complex than rolling back in a healthy system. Operators are managing two problems simultaneously — the pre-existing degradation and the rollback — which increases MTTR.
The Burn Rate Decision Matrix
| Burn Rate | Meaning | Deploy Recommendation |
|---|---|---|
| < 1× | Below expected error rate | Safe to deploy |
| 1–2× | Normal — within expected variance | Safe to deploy |
| 2–5× | Elevated — something may be wrong | Investigate before deploying non-critical changes |
| 5–10× | High — active degradation likely | Block non-emergency deployments. Require explicit override. |
| > 10× | Critical — active incident in progress | Block all deployments except emergency hotfixes |
Pulling Burn Rate Data in Practice
Burn rate data comes from your observability stack. The three most common sources:
Prometheus / Grafana
# PromQL: Calculate 1-hour burn rate for error SLO
# Assumes you track http_requests_total with label status="5xx"
burn_rate = (
rate(http_requests_total{status=~"5xx"}[1h])
/
rate(http_requests_total[1h])
) / 0.001 # 0.001 = 1 - 0.999 SLO targetDatadog
# Datadog API: Fetch SLO status
import requests
DD_API_KEY = "your-api-key"
DD_APP_KEY = "your-app-key"
SLO_ID = "your-slo-id"
resp = requests.get(
f"https://api.datadoghq.com/api/v1/slo/{SLO_ID}/history",
headers={
"DD-API-KEY": DD_API_KEY,
"DD-APPLICATION-KEY": DD_APP_KEY,
},
params={
"from_ts": int(time.time()) - 3600, # Last 1 hour
"to_ts": int(time.time()),
}
)
slo_data = resp.json()
# Parse burn_rate from slo_data["data"]["series"]["burn_rate"]Implementing Burn Rate Deployment Gating
The implementation pattern: before a deployment proceeds to production, check the current burn rate of the affected service. If it exceeds your threshold, fail a required status check (using the GitHub Check Runs API described in the Check Runs tutorial) and require explicit manual override with documented justification.
def deployment_burn_rate_gate(service_name: str, slo_target: float) -> dict:
"""
Check current burn rate and return deployment recommendation.
"""
current_error_rate = get_current_error_rate(service_name) # From your observability
rate = burn_rate(current_error_rate, slo_target)
if rate > 10:
return {
"allow_deploy": False,
"require_override": True,
"reason": f"Critical: burn rate {rate:.1f}x — likely active incident",
"github_conclusion": "failure",
}
elif rate > 5:
return {
"allow_deploy": False,
"require_override": True,
"reason": f"High burn rate {rate:.1f}x — investigate before deploying",
"github_conclusion": "failure",
}
elif rate > 2:
return {
"allow_deploy": True,
"require_override": False,
"reason": f"Elevated burn rate {rate:.1f}x — monitor closely post-deploy",
"github_conclusion": "neutral",
}
else:
return {
"allow_deploy": True,
"require_override": False,
"reason": f"Normal burn rate {rate:.1f}x — safe to deploy",
"github_conclusion": "success",
}The Operational Conversation This Creates
Beyond the automated gate, burn rate visibility creates a productive operational conversation. When engineers see that the deployment is blocked because burn rate is 6×, the natural next question is "why is burn rate elevated?" — which may surface an underlying issue that nobody had formally identified as an active incident.
This is the secondary value of burn rate gating: it forces a conversation about system health before adding more change. Many teams discover pre-existing degradation they were not tracking through the burn rate check on an unrelated deployment.
Koalr integrates burn rate into deploy risk scores
Koalr pulls burn rate data from your observability platform and incorporates it as a contextual signal in the deploy risk score. High-burn-rate environments automatically elevate the risk score for any pending deployment, giving your team the full risk picture before they click merge.
Gate deployments on SLO health automatically
Koalr connects to your observability platform and adds real-time SLO burn rate to the deploy risk score — blocking high-risk deployments into stressed systems before they compound existing incidents. Connect GitHub in 5 minutes.