SLO Basics: Error Budgets and Burn Rate

A Service Level Objective (SLO) defines the target reliability for a service — for example, 99.9% availability per month. The error budget is the inverse: the amount of downtime the SLO allows within the measurement period. For 99.9% availability over a 30-day month, the error budget is 43.8 minutes.

Burn rate measures how quickly you are consuming the error budget relative to the expected rate. If you have 43.8 minutes of budget for the month and are burning it at the rate of 10 minutes per day, your burn rate is roughly 7× (you would exhaust the budget in 4.4 days instead of 30).

The formal burn rate calculation:

# Burn rate = (actual error rate / allowed error rate)
# Where allowed error rate = 1 - SLO target

def burn_rate(current_error_rate: float, slo_target: float) -> float:
    """
    current_error_rate: fraction of requests failing (e.g., 0.002 = 0.2%)
    slo_target: SLO target (e.g., 0.999 = 99.9%)
    Returns: burn rate multiplier (1.0 = normal, >1 = burning faster than budget)
    """
    allowed_error_rate = 1 - slo_target
    return current_error_rate / allowed_error_rate

# Example: 99.9% SLO, currently seeing 0.5% error rate
# allowed = 0.001, current = 0.005
# burn_rate = 0.005 / 0.001 = 5.0x — consuming budget 5x faster than expected
rate = burn_rate(current_error_rate=0.005, slo_target=0.999)
print(f"Burn rate: {rate:.1f}x")  # 5.0x

Why High Burn Rate Increases Deployment Risk

When a system is already consuming its error budget faster than expected, it is telling you something is wrong — even if no one has paged yet. The underlying causes might be: increased traffic that is exposing a latent scaling issue, a subtle bug introduced in a recent deployment, degraded infrastructure, or an upstream dependency that is intermittently failing.

Deploying a new change into this environment creates multiple compounding risks:

Attribution confusion. When a deployment happens during elevated burn rate and the error rate gets worse afterward, it is extremely difficult to determine causation. Did the deployment cause the degradation, or was the degradation already occurring and the deployment is unrelated? This confusion slows incident response because the team is investigating the wrong thing.

SLO breach risk. If the deployment causes any additional errors — even a small amount — it may push an already-stressed system over the SLO threshold. A system burning budget at 3× is one small incident away from an SLO breach. Deploying into that state is accepting unnecessary risk.

Rollback complications. Rolling back a deployment in a system that is already degraded is significantly more complex than rolling back in a healthy system. Operators are managing two problems simultaneously — the pre-existing degradation and the rollback — which increases MTTR.

The Burn Rate Decision Matrix

Burn Rate	Meaning	Deploy Recommendation
< 1×	Below expected error rate	Safe to deploy
1–2×	Normal — within expected variance	Safe to deploy
2–5×	Elevated — something may be wrong	Investigate before deploying non-critical changes
5–10×	High — active degradation likely	Block non-emergency deployments. Require explicit override.
> 10×	Critical — active incident in progress	Block all deployments except emergency hotfixes

Pulling Burn Rate Data in Practice

Burn rate data comes from your observability stack. The three most common sources:

Prometheus / Grafana

# PromQL: Calculate 1-hour burn rate for error SLO
# Assumes you track http_requests_total with label status="5xx"
burn_rate = (
    rate(http_requests_total{status=~"5xx"}[1h])
    /
    rate(http_requests_total[1h])
) / 0.001  # 0.001 = 1 - 0.999 SLO target

Datadog

# Datadog API: Fetch SLO status
import requests

DD_API_KEY = "your-api-key"
DD_APP_KEY = "your-app-key"
SLO_ID = "your-slo-id"

resp = requests.get(
    f"https://api.datadoghq.com/api/v1/slo/{SLO_ID}/history",
    headers={
        "DD-API-KEY": DD_API_KEY,
        "DD-APPLICATION-KEY": DD_APP_KEY,
    },
    params={
        "from_ts": int(time.time()) - 3600,  # Last 1 hour
        "to_ts": int(time.time()),
    }
)
slo_data = resp.json()
# Parse burn_rate from slo_data["data"]["series"]["burn_rate"]

Implementing Burn Rate Deployment Gating

The implementation pattern: before a deployment proceeds to production, check the current burn rate of the affected service. If it exceeds your threshold, fail a required status check (using the GitHub Check Runs API described in the Check Runs tutorial) and require explicit manual override with documented justification.

def deployment_burn_rate_gate(service_name: str, slo_target: float) -> dict:
    """
    Check current burn rate and return deployment recommendation.
    """
    current_error_rate = get_current_error_rate(service_name)  # From your observability
    rate = burn_rate(current_error_rate, slo_target)

    if rate > 10:
        return {
            "allow_deploy": False,
            "require_override": True,
            "reason": f"Critical: burn rate {rate:.1f}x — likely active incident",
            "github_conclusion": "failure",
        }
    elif rate > 5:
        return {
            "allow_deploy": False,
            "require_override": True,
            "reason": f"High burn rate {rate:.1f}x — investigate before deploying",
            "github_conclusion": "failure",
        }
    elif rate > 2:
        return {
            "allow_deploy": True,
            "require_override": False,
            "reason": f"Elevated burn rate {rate:.1f}x — monitor closely post-deploy",
            "github_conclusion": "neutral",
        }
    else:
        return {
            "allow_deploy": True,
            "require_override": False,
            "reason": f"Normal burn rate {rate:.1f}x — safe to deploy",
            "github_conclusion": "success",
        }

The Operational Conversation This Creates

Beyond the automated gate, burn rate visibility creates a productive operational conversation. When engineers see that the deployment is blocked because burn rate is 6×, the natural next question is "why is burn rate elevated?" — which may surface an underlying issue that nobody had formally identified as an active incident.

This is the secondary value of burn rate gating: it forces a conversation about system health before adding more change. Many teams discover pre-existing degradation they were not tracking through the burn rate check on an unrelated deployment.

Koalr integrates burn rate into deploy risk scores

Koalr pulls burn rate data from your observability platform and incorporates it as a contextual signal in the deploy risk score. High-burn-rate environments automatically elevate the risk score for any pending deployment, giving your team the full risk picture before they click merge.

SLO Burn Rate: The Signal Nobody Watches Until It's Too Late