Site Reliability EngineeringMarch 16, 2026 · 12 min read

SRE Metrics Guide: The Complete Reference for Site Reliability Engineers

SRE metrics fall into two distinct categories: service health (SLIs, SLOs, error budgets) and delivery reliability (DORA). Most teams measure one or the other. Elite SRE teams measure both — and understand how they connect. This guide covers every metric category an SRE team needs, with formulas, benchmarks, and the instrumentation strategy to tie it all together.

What this guide covers

The five core SLIs and how to calculate them, how to set SLOs and error budgets that actually hold up, burn rate alerting formulas, DORA metrics from an SRE perspective, incident metrics (MTTD, MTTF, cascading failures), alert quality scoring, capacity planning metrics, and how delivery reliability and service health connect in practice.

SRE Metrics: Two Domains, One Platform

Site reliability engineering sits at the intersection of software engineering and operations. As a result, SRE teams are responsible for two fundamentally different categories of metrics — and confusing them leads to bad decisions.

Service health metrics (SLIs, SLOs, error budgets) answer the question: is the system working right now? They are real-time, continuous measurements of live production traffic. They tell you whether users are experiencing the service you promised them.

Delivery reliability metrics (DORA) answer the question: how reliably is the team changing the system? They are historical aggregations over deployments and incidents. They tell you whether your engineering process is creating risk or managing it.

Both matter. A team with perfect SLO compliance but a 30% change failure rate is one bad deployment away from an SLO breach they cannot explain. A team with elite DORA numbers but no SLO discipline has no idea what reliability level they are actually delivering. The most effective SRE practices instrument both and understand how they interact.

Service Level Indicators (SLIs)

An SLI is a direct measurement of a service behavior that matters to users. It is a ratio: events that went well divided by all events. The Google SRE book defines five core SLIs. Every service needs at least the first three.

1. Availability

Availability is the most fundamental SLI — the fraction of requests the service successfully handled.

availability = (total_requests - error_requests) / total_requests

Where "error" means a response that failed to serve the user's intent — typically HTTP 5xx responses, but also requests that timed out, returned corrupt data, or were dropped by the load balancer before reaching your service. The exact definition of "error" must be standardized before you set SLOs. If your SLO says 99.9% availability but you only count 500 errors and ignore 503s, the SLO is meaningless.

Target for most services: 99.9% or higher. At 99.9%, you have 43.8 minutes of allowed downtime per month. At 99.95%, that drops to 21.9 minutes. At 99.99% ("four nines"), it is 4.4 minutes per month — which requires automated failover to sustain.

SLO TargetMonthly downtime allowedWeekly downtime allowedTypical use case
99.0%7.3 hours1.68 hoursInternal tools, batch processing
99.9%43.8 min10.1 minMost SaaS services, APIs
99.95%21.9 min5.0 minBusiness-critical SaaS, fintech
99.99%4.4 min1.0 minPayments, healthcare, enterprise SLAs

2. Latency

Latency SLIs measure how fast your service responds to requests. Never use average (mean) latency as your SLI — it masks the worst user experiences. Use percentiles.

latency_sli = requests_served_within_threshold / total_requests

The three percentiles that matter most for SRE:

  • P50 (median): The "typical" user experience. Most requests are faster than this. A high P50 means your system is generally slow.
  • P95: 95% of requests are faster than this. Where degradation first shows up under moderate load. A good early warning signal.
  • P99: The "slow" tail — 1% of requests are slower than this. These are the users most likely to churn. P99 latency often spikes before P50 shows any movement, making it a leading indicator of saturation.

A typical latency SLO: "99% of requests will complete in under 300ms." This is a P99 latency target. Set it based on user research — what latency causes users to perceive the service as slow? For interactive web applications, the threshold is usually 200–500ms for P99.

3. Error Rate

Error rate measures the fraction of requests that fail. It is closely related to availability but is often instrumented separately because "error" and "unavailability" can diverge — a service can be up but returning errors on a subset of request types.

error_rate = error_responses / total_responses

Errors should be classified by severity: HTTP 5xx (server errors your service caused), HTTP 4xx (client errors, usually not your fault unless your API is confusing), and application-level errors (business logic failures that return 200 but include an error payload). SLOs should be based on 5xx rates and application-level errors, not 4xx.

4. Throughput

Throughput measures requests per second (RPS) or transactions per second (TPS). It is primarily a capacity planning signal rather than a service health signal — you track it to understand current demand relative to your capacity ceiling.

throughput = successful_requests / measurement_window_seconds

Throughput is most useful when trended over time and correlated with business events (launches, campaigns, end-of-month billing runs) to predict when you will need to scale. A sudden drop in throughput when demand should be high is also an incident signal — it often means the service is failing requests before they are counted.

5. Saturation

Saturation measures how close your system resources are to their operational limits. It is expressed as a fraction of maximum capacity.

saturation = current_utilization / maximum_capacity

Resource dimensions to track: CPU utilization, memory utilization, disk I/O, network bandwidth, database connection pool utilization, queue depth (for async systems), and thread pool utilization. Saturation above 80% on any critical resource is a leading indicator of latency degradation — queuing theory shows that response time rises nonlinearly as utilization approaches 100%.

SLOs and Error Budgets

An SLO (Service Level Objective) is a target value for an SLI. An SLA (Service Level Agreement) is the contractual commitment to a customer that is typically weaker than your internal SLO. The relationship:

SLA ≤ SLO ≤ actual performance

Set your internal SLO tighter than your SLA by enough margin to catch and fix issues before they breach the contractual commitment. If your SLA promises 99.9% availability, your internal SLO should target 99.95%.

Error Budgets

The error budget is the complement of your SLO — the amount of unreliability you are allowed to "spend" before breaching your target.

error_budget = 1 - SLO

# Example: 99.9% SLO
error_budget = 1 - 0.999 = 0.001 = 0.1% of requests can fail
# Monthly: 0.001 × 43,800 minutes = 43.8 minutes of downtime allowed

The error budget is not just an accounting mechanism — it is a policy tool. When the error budget is healthy (above 50% remaining), teams can deploy aggressively and experiment. When the error budget is nearly exhausted, teams should freeze non-critical changes, focus on reliability improvements, and deprioritize feature work until the budget resets. This is the core SRE discipline: trading off reliability and velocity using error budgets as the shared language between product and engineering.

Burn Rate Alerting

Burn rate tells you how fast you are consuming your error budget relative to the allowed rate. A burn rate of 1.0 means you are consuming budget at exactly the rate that would exhaust it by the end of the period. A burn rate of 2.0 means you will exhaust your monthly budget in half a month.

burn_rate = (1 - current_availability) / (1 - SLO)

# Example: SLO = 99.9%, current availability over 1hr window = 99.7%
burn_rate = (1 - 0.997) / (1 - 0.999) = 0.003 / 0.001 = 3.0

# Burn rate 3.0 means: at this rate, you exhaust your monthly budget
# in 10 days instead of 30

The Google SRE Workbook recommends a multi-window burn rate alert strategy:

  • Page immediately (burn rate > 14.4 over 1 hour): You will exhaust your monthly error budget in 2 hours. This is a "drop everything" alert.
  • Page with urgency (burn rate > 6 over 6 hours): You will exhaust your monthly budget in 5 days. Needs attention within hours.
  • Ticket / Slack alert (burn rate > 3 over 3 days): Sustained elevated error rate. No immediate crisis, but trending toward one.

For more on how SLO burn rate connects to deployment timing, see SLO burn rate and deployment windows.

DORA Metrics for SREs

DORA metrics are delivery reliability metrics — they measure how well your team manages risk during the process of changing the system. Most SRE teams know about DORA but treat it as a separate engineering metrics concern. That is a mistake. DORA and SLO health are directly connected.

For a complete treatment of DORA calculation and benchmarks, see the complete DORA metrics guide. Here is how each metric reads from an SRE perspective.

Deployment Frequency

For SREs, deployment frequency is a risk-per-unit-time signal. If your team deploys once per week and your change failure rate is 10%, you expect one failed deployment every 10 weeks. If deployment frequency increases to daily without CFR improvements, you now expect one failed deployment every 10 days — a 7x increase in incident exposure.

Elite benchmark: multiple deployments per day. But elite deployment frequency without elite CFR is a reliability liability, not an asset. Track them together.

Change Failure Rate

Change failure rate (CFR) is the primary link between delivery process and SLO burn rate. Every failed deployment burns error budget. The relationship is direct:

expected_monthly_SLO_burn_from_deploys =
  deployment_frequency_per_month × CFR × avg_incident_duration_minutes
  ÷ total_minutes_per_month (43,800)

A team deploying 20 times per month with a 10% CFR and average 2-hour incidents will burn approximately 0.09% of their availability per month from deployment-caused incidents alone — nearly their entire error budget at a 99.9% SLO. Understanding this math is what transforms SREs from reactive incident responders into proactive reliability engineers.

Elite benchmark: below 5% CFR.

MTTR (Mean Time to Restore)

For SREs, MTTR is the rate at which you recover error budget after a failure. A fast MTTR limits the blast radius of each incident. The connection to SLO math is direct: a 2-hour MTTR for the same incident consumes twice the error budget of a 1-hour MTTR.

Elite benchmark: under one hour. Getting MTTR below one hour requires automated alerts (no manual detection lag), runbooks that enable fast diagnosis, and rollback capabilities that take minutes, not hours. For more on improving this metric, see how to improve MTTR.

Lead Time for Changes

Lead time affects SRE operations in two ways. Short lead time enables faster hotfix deployment when incidents occur — the same pipeline that ships features also ships fixes. Paradoxically, very short lead time with insufficient review can increase CFR by reducing the signal a team has before a change reaches production. Elite benchmark: under one day (most SRE teams target under four hours for critical paths).

DORA and SLOs: the research connection

DORA report data shows that teams in the elite DORA performance tier have approximately 50% lower SLO breach rates than teams in the medium tier. The mechanism is direct: elite DORA teams deploy more frequently with lower CFR and faster MTTR, which means less error budget consumed per unit time. DORA is not separate from SRE — it is the upstream lever that determines how fast you drain your error budget.

Incident Metrics

Service health metrics tell you when the system is failing. Incident metrics tell you how well your team detects, contains, and learns from failures. They are the operational layer between your SLIs and your SLOs.

MTTD — Mean Time to Detect

MTTD measures the lag between when a failure starts and when your team becomes aware of it. It is the gap between a system failing and an alert firing — or, in the worst case, between a system failing and a user complaint arriving.

MTTD = mean(alert_fired_at - failure_started_at)

MTTD is often underreported because "failure started" is hard to determine retrospectively. Use your SLO dashboards to identify when the SLI first crossed the threshold — that is your failure start time. Compare it to when the first alert was acknowledged.

High MTTD (>5 minutes for P0 incidents) means your alerting is too slow, your thresholds are too lenient, or your burn rate alerts are not configured. The error budget burned during MTTD is pure waste — you are failing users without even knowing it.

MTTF — Mean Time to Failure

MTTF measures system reliability between incidents — the average time your system runs without a failure. It is the inverse of incident frequency.

MTTF = total_uptime / number_of_failures

MTTF is most useful for capacity planning and reliability targeting. If your service has a 30-day MTTF for P1 incidents and your SLO allows 43.8 minutes of downtime per month, you have exactly 43.8 minutes of MTTR budget per incident before you breach your SLO.

Incident Frequency by Severity

Track incident frequency at each severity level (P0, P1, P2, P3) separately. Aggregate incident counts hide severity distribution shifts — a team that converts P1 incidents into P2 incidents is improving even if total incident count stays flat.

SeverityTypical definitionTarget MTTRSLO budget impact
P0Complete service outage, all users affected<30 minMassive — every minute counts
P1Major degradation, significant users affected<1 hourHigh — triggers burn rate alerts
P2Partial degradation, workaround available<4 hoursModerate — slow budget drain
P3Minor issue, small user subset<24 hoursLow — within noise floor

Cascading Failure Rate

A cascading failure is an incident that triggers one or more secondary failures — a database overload causing API timeouts, which causes a retry storm, which overloads the queue. Track cascading failures as a percentage of all incidents.

cascading_failure_rate = incidents_with_secondary_failures / total_incidents

High cascading failure rates indicate insufficient circuit breakers, missing bulkheads, or load shedding that is not properly configured. Cascading failures are disproportionate error budget consumers because they turn a contained problem into a multi-service incident. A 10% cascading failure rate means 10% of your incidents are consuming 2-5x the error budget they would in isolation.

Alert Quality Metrics

Alert quality is one of the most neglected categories of SRE measurement. Poor alerting causes alert fatigue, which causes missed real incidents, which causes SLO breaches that nobody detected. Measuring your alerting system is as important as measuring the services it monitors.

Alert-to-Incident Ratio

The fraction of alerts that correspond to real, actionable incidents. An alert that fires but resolves without human intervention — or fires but turns out to be a false positive — is noise.

alert_to_incident_ratio = real_incidents / total_alerts_fired

Target: above 80%. If fewer than 80% of your alerts represent real problems, your on-call engineers are spending more than 20% of their alert-handling time on noise. Below 60% is alert fatigue territory — on-call engineers begin to treat alerts as background noise, increasing the risk of a real incident being missed.

Actionable Alert Rate

The fraction of alerts that require human action versus those that resolve automatically. An alert that pages a human but resolves before they can act is waste. These alerts should either be auto-remediated or suppressed until a threshold that actually requires human judgment.

actionable_alert_rate = alerts_requiring_human_action / total_alerts_fired

Track this separately from alert-to-incident ratio. An alert can represent a real incident but still be non-actionable if it resolves before a human could intervene — that is a signal to invest in automated remediation, not just to suppress the alert.

Alert Fatigue Score

Alert fatigue score measures the volume of alert noise on-call engineers are absorbing per shift. More than 10 alerts per on-call shift is the threshold above which cognitive load becomes dangerous.

alert_fatigue_score = total_alerts_per_shift / on_call_engineers_on_shift

Track this by team and by time window (business hours vs. nights/weekends). Alert volume that is manageable during the day becomes dangerous at 3am when the on-call engineer is already sleep-deprived. Teams with alert fatigue scores above 15 per shift consistently show higher MTTD and lower postmortem completion rates.

Alert metricHealthyWarning zoneAlert fatigue
Alert-to-incident ratio>80%60–80%<60%
Actionable alert rate>70%50–70%<50%
Alerts per on-call shift<1010–20>20

Capacity Planning Metrics

Capacity planning is where SRE intersects with finance and infrastructure. The goal is to maintain enough headroom to absorb traffic spikes without SLO impact, without over-provisioning to the point of wasting significant budget. The metrics that enable this:

Growth Rate vs. Infrastructure Scaling Rate

The ratio between how fast traffic is growing and how fast your infrastructure can be provisioned or scaled. If traffic is growing at 15% per month and your infrastructure team can provision capacity in 3 weeks, you need runway forecasts that account for that lag. The metric:

capacity_gap_days =
  (capacity_ceiling - current_utilization) / daily_growth_rate

If this drops below your provisioning lead time (in days), you have a capacity crisis in progress. Track it per critical resource (compute, database connections, storage) and alert when headroom drops below 30 days.

P99 Latency Under Load

Run load tests against your services at 100%, 150%, and 200% of current peak traffic. Record P99 latency at each load level. This tells you your "cliff edge" — the traffic level at which latency degrades past your SLO threshold.

latency_headroom_ratio = SLO_latency_threshold / P99_at_current_peak_traffic

A ratio above 2.0 means you have substantial headroom before load causes SLO breaches. Below 1.2 means a traffic spike 20% above your current peak could breach your latency SLO. Use this to prioritize performance optimization work.

Resource Utilization Variance

Track actual resource utilization against your provisioned plan. Variance in either direction is a signal: consistently running below 40% means you are over-provisioned and paying for unused capacity; consistently running above 70% means your capacity plans are too aggressive and you are running close to the cliff.

How SRE Metrics Connect to Engineering Metrics

The integration between SRE metrics and engineering delivery metrics is not optional — it is the point. Most SRE teams have excellent SLO dashboards and poor visibility into the delivery process that creates their incidents. Most engineering teams have DORA dashboards and no connection to the SLO impact of their CFR numbers.

The connections that matter most:

  • CFR → SLO burn rate: Every deployment failure consumes error budget. Track CFR and error budget consumption on the same dashboard to understand the reliability cost of your deployment process.
  • MTTR → error budget recovery rate: How fast can you stop the bleeding after a failed deploy? Fast MTTR limits the budget consumed per incident. Slow MTTR turns a recoverable situation into an SLO breach.
  • Deployment frequency × CFR → expected incident rate: The product of these two numbers tells you how many deployment-caused incidents you should expect per month. If that number is too high relative to your error budget, you need to cut CFR before increasing deployment frequency.
  • Deploy risk score → future SLO impact: A high-risk deployment (large change, low coverage, unfamiliar author, no reviewer expertise) is likely to fail. Catching it before merge means catching it before it becomes an SLO event.

Teams with elite DORA metrics have approximately 50% lower SLO breach rates than medium-tier teams — this correlation holds across company sizes and industries. The mechanism is exactly these connections: better delivery practices mean fewer failed deployments, which means slower error budget consumption, which means more headroom to ship features without risking SLO breaches.

Koalr: DORA + incident metrics in one platform

Koalr connects GitHub, PagerDuty, and Incident.io to give you DORA metrics and MTTR trending in one view. Connect your incident tool and your first MTTR trend chart is ready in minutes. No custom instrumentation required.

Building Your SRE Metrics Stack

The implementation order matters. Teams that try to instrument everything at once end up with dashboards nobody maintains. A staged approach:

Stage 1 — Core SLIs (week 1–2): Availability and error rate from your existing monitoring (Datadog, Prometheus, CloudWatch). Define "error" consistently. Set your first SLO. Calculate your error budget. These three steps alone will change how your team talks about reliability.

Stage 2 — Latency and burn rate (week 2–4): Add P50/P95/P99 latency dashboards. Configure burn rate alerts at the 1-hour, 6-hour, and 3-day windows. This is when your alerting starts to reflect real user impact instead of arbitrary thresholds.

Stage 3 — DORA metrics (month 1–2): Connect GitHub for deployment frequency and lead time. Connect PagerDuty or Incident.io for MTTR and incident frequency. Set CFR thresholds. Koalr automates this entire stage if you have those integrations connected.

Stage 4 — Alert quality and capacity (month 2–3): Instrument alert-to-incident ratio and alerts-per-shift. Run baseline load tests to establish capacity headroom ratios. Build capacity forecasts from growth rate data.

Stage 5 — Integration and action (ongoing): Connect error budget burn to deployment freeze policies. Gate deployment windows on remaining error budget. Use deploy risk scores to gate high-risk changes during low-budget periods. This is where SRE metrics stop being dashboards and start being operational policy.

Common SRE Metrics Mistakes

The implementation mistakes that show up most consistently across SRE teams of all sizes:

  • Using mean latency instead of percentiles. Mean latency is mathematically dominated by the median — it hides your worst user experiences. Always use P95 and P99 for SLO definition.
  • Setting SLOs that are already met with no effort. A 99.5% SLO when you are running at 99.97% is not a target — it is a rubber stamp. SLOs should require engineering attention to maintain.
  • Not defining "resolved" for MTTR. If different on-call engineers close incidents at different stages, your MTTR data is worthless. Define "resolved" as "service restored to within SLO" and enforce it.
  • Tracking incident count instead of error budget impact. Ten 5-minute incidents are less damaging than one 3-hour incident. Count incidents by severity and by error budget consumed, not just by count.
  • No connection between DORA and SLO metrics. Engineering teams optimizing deployment frequency without visibility into CFR and SLO impact are making reliability decisions without the data they need.

Connect PagerDuty or Incident.io — get MTTR trending in minutes

Koalr pulls incident data from PagerDuty and Incident.io automatically, correlates it with your GitHub deployments, and gives you DORA metrics plus MTTR trends in one platform. No custom instrumentation. No data pipelines to maintain.