On-Call HealthMarch 16, 2026 · 11 min read

On-Call Best Practices: How to Build a Healthy On-Call Rotation Without Burning Out Your Engineers

On-call duty is one of the most demanding commitments an engineer makes to their team. Done well, it is a shared responsibility that builds system ownership and resilience. Done poorly, it becomes a chronic source of sleep deprivation, anxiety, and attrition. This guide covers the practices that separate teams where on-call is a manageable burden from teams where it is quietly destroying morale.

The on-call burnout numbers are alarming

62% of engineers report that on-call burden negatively affects their well-being. Alert fatigue — being paged by alerts that do not require action — is the single most frequently cited cause of on-call burnout on infrastructure teams. The problem is not that on-call is inherently harmful. It is that most on-call programs have never been deliberately designed to be sustainable.

The On-Call Burnout Crisis

On-call burnout does not announce itself with a single dramatic event. It accumulates through repeated sleep disruptions, through pages that fire and auto-resolve before the engineer can even open their laptop, through shifts where the volume of alerts makes each individual alert feel meaningless. Over time, engineers learn to delay acknowledgment, to silence notifications, and — eventually — to leave for teams with quieter on-call rotations.

The 62% figure is a floor estimate. Engineers who have already left noisy rotations are not counted. Engineers who have adapted by simply routing pages to silent channels are not counted. The real proportion of engineers whose well-being is affected by on-call burden is almost certainly higher.

Alert fatigue is the primary mechanism. When an on-call engineer receives 40 pages in a shift and 35 of them require no action — they self-resolve, they are informational, they fire on thresholds that do not reflect real user impact — the 5 that do require action receive degraded attention. The cognitive cost of evaluating each alert is paid regardless of whether the alert needs a response. Alert fatigue is not a discipline problem. It is a system design problem, and it has system design solutions.

This guide covers those solutions, and everything else that goes into a sustainable on-call program: how to measure rotation health, how to structure the rotation itself, how to compensate engineers fairly, and how to build the psychological safety that makes engineers willing to engage honestly with post-mortems rather than deflecting blame.

The 5 On-Call Health Metrics

You cannot improve what you do not measure. Most teams measure incident count and MTTR — and stop there. Those metrics tell you whether your systems are reliable. They do not tell you whether your on-call program is sustainable. These five metrics measure the human side of on-call operations.

MetricDefinitionHealthy Target
Alerts per shiftTotal pages received by one on-call engineer during a shift period<10 actionable alerts
P1 incidents per weekCount of highest-severity incidents requiring immediate human response<2 per week
Alert-to-incident ratioPages fired divided by pages that became declared incidents<3:1 (aim for 1.5:1)
Time to acknowledgeMedian time from alert fire to engineer acknowledgment<5 min for P1
Sleep disruption ratePercentage of on-call shifts with at least one page between midnight and 6am<20% of shifts

The alert-to-incident ratio is the most diagnostic of the five. A ratio above 3:1 means more than two-thirds of your pages are noise. This is the clearest signal that alert hygiene work is overdue. A ratio above 5:1 is a crisis — engineers are spending most of their on-call cognitive energy evaluating alerts that should never have fired.

Sleep disruption rate requires honest self-reporting or integration with your on-call tooling to identify pages that fire during sleep hours. PagerDuty and Incident.io both surface this in their analytics. Track it per engineer, per rotation, and per service — you will find that most sleep disruptions come from a small number of high-noise services, which makes prioritization straightforward.

For a detailed breakdown of how these metrics connect to your broader reliability picture, see the SRE metrics guide.

Designing a Healthy On-Call Rotation

Rotation design is the structural foundation of on-call health. Even with perfect alert hygiene, a poorly designed rotation will exhaust engineers. These are the design principles that reduce burnout at the structure level.

Minimum 4-Person Rotation

A two or three-person rotation means each engineer is on call roughly half the time. Even with low alert volume, the psychological burden of being on call — always reachable, never fully off — accumulates. Four people means a maximum of one week in four, which is the threshold below which most engineers describe on-call as manageable rather than oppressive. Six to eight people is the target for high-traffic services.

If your team does not have four engineers who can reasonably cover a service, you have a staffing problem, not an on-call design problem. Spreading two engineers across a 24/7 rotation is not a solution — it is a slow-motion attrition risk that will leave you with zero engineers who know that service.

Follow-the-Sun for Global Teams

For teams distributed across multiple time zones, follow-the-sun rotation assigns on-call coverage to the region where it is currently business hours. An engineer in London covers European mornings; an engineer in San Francisco covers Pacific afternoons; an engineer in Singapore covers the Asian window. When implemented correctly, every engineer on the rotation handles on-call only during their own waking hours.

Follow-the-sun requires explicit handoff procedures (covered below) and enough engineers in each region to avoid single points of coverage. It is not viable with one engineer per timezone — but for teams of 8 or more distributed across three regions, it effectively eliminates the sleep disruption problem for engineers who are not covering an on-call emergency in their region.

Week-On, Not Night-On-Night

Nightly rotation — where on-call responsibility shifts every 24 hours — sounds fair because it distributes burden evenly. In practice, it is more exhausting than weekly rotation because it eliminates the ability to plan around on-call. Weekly rotation gives engineers a predictable block: they know which week is their on-call week and can schedule travel, appointments, and personal commitments around it.

The objection to weekly rotation is that a bad week is very bad. This is true — but it is addressed by alert hygiene and rotation size, not by fragmenting the rotation further. A quiet week on call is much less burdensome on a weekly schedule than on a daily one.

Dedicated Ramp-Up Time Post-Shift

Engineers coming off a high-volume on-call shift should not be expected to immediately re-engage at full cognitive capacity. A post-shift recovery day — or at minimum a recovery morning — where the engineer has no meetings, no sprint commitments, and explicit permission to catch up on sleep or low-intensity tasks, produces measurably better outcomes than expecting immediate full productivity return.

This is not a luxury accommodation. It is a reliability investment. An exhausted engineer making post-incident architecture decisions or reviewing high-risk PRs on the day after a grueling on-call shift is a risk to your systems. Recovery time pays for itself in prevented errors.

Alert Hygiene — Taming the Noise

Alert hygiene is the highest-leverage on-call improvement available to most teams. It requires no new tooling, no new headcount, and no architectural changes. It requires one thing: a commitment to the principle that every alert that pages an engineer must require that engineer to take action.

Actionable vs. Informational Alerts

The first classification every alert needs is: is this actionable or informational? An actionable alert requires the on-call engineer to do something — acknowledge, diagnose, escalate, or remediate. An informational alert communicates a system state that is worth knowing but does not require immediate human response.

Informational alerts do not belong in your paging system. They belong in a dashboard, a Slack channel that nobody monitors at 3am, or a daily digest. The moment an informational alert is routed through PagerDuty, it begins training engineers to ignore PagerDuty. That is the root of alert fatigue, and it is entirely preventable.

Conduct a quarterly alert audit: for each alert in your paging system, answer the question "what does the on-call engineer do when this fires?" If the answer is "check a dashboard and if everything looks fine, close it" — that is an informational alert. Remove it from the paging rotation.

Auto-Resolved Alerts Do Not Count as Actionable

An alert that fires and auto-resolves before the engineer acknowledges it is not alerting — it is alarming. It provides no opportunity for remediation and exists only to interrupt the engineer's sleep. If an alert regularly auto-resolves, either the threshold is wrong (raise it until the alert only fires when the condition persists long enough to require action) or the condition it is monitoring self-heals and does not require human intervention (in which case it is informational and should be removed from the paging system entirely).

Track your auto-resolution rate. Any alert with an auto-resolution rate above 30% is a candidate for immediate removal or significant threshold revision.

SLO Burn Rate Alerting vs. Threshold Alerting

Threshold alerting — "page me when latency exceeds 500ms" — generates false positives (brief spikes that self-resolve) and false negatives (gradual degradation that stays below threshold while burning the error budget). SLO burn rate alerting fires based on how fast you are consuming your error budget, not on whether a point-in-time metric crossed a line.

A 14.4x burn rate means you will exhaust your monthly error budget in 50 hours — worth an immediate page. A 6x burn rate means you will exhaust it in 5 days — worth a ticket and a next-business-day investigation. A 1x burn rate means you are on track — not worth waking anyone up. Burn rate alerts are self-calibrating to real user impact and dramatically reduce the false positive rate that drives alert fatigue.

For a deeper treatment of SLO alerting theory and implementation, see the SRE metrics guide.

Runbook-First Alerting

Every alert that pages an engineer must have a runbook — a documented set of diagnostic steps and remediation options that the on-call engineer follows when the alert fires. No runbook means the engineer must reconstruct the diagnostic process from memory under stress. This adds minutes to every incident, and those minutes compound across a year of incidents.

The runbook does not need to be comprehensive. It needs three things: what the alert means, what to check first, and what to do if the standard fix does not work (which is usually "escalate to the service owner"). A 10-line runbook that exists is infinitely more valuable than a 50-page runbook that is still being written.

If an alert does not have a runbook, do not add it to the paging system until it does. This one rule, applied consistently, will force you to write runbooks for your most important alerts and will prevent the accumulation of mystery alerts that nobody understands anymore.

The On-Call Handoff Checklist

The handoff between outgoing and incoming on-call engineers is one of the highest-risk moments in the on-call cycle. An engineer arriving to a shift without knowledge of the current system state starts from zero — which means the first 30 minutes of their shift are spent reconstructing context rather than responding to incidents. A structured handoff eliminates this cold-start cost.

On-call handoff checklist

  • ☐ Open incidents: severity, current status, owner, expected resolution
  • ☐ Recent deployments in the last 24 hours: service, SHA, deploying engineer
  • ☐ Pending tasks from current shift: follow-ups, postmortem drafts, tickets opened
  • ☐ Known flaky alerts: alerts that are firing but are known non-actionable
  • ☐ Current SLO burn rates: any services burning above 2x baseline
  • ☐ Scheduled maintenance or planned changes in the next shift window
  • ☐ Any on-call tooling issues or escalation path changes

The handoff should be written, not verbal. A verbal handoff is forgotten within an hour and cannot be referenced during a 3am incident. A written handoff — in a shared document, a dedicated Slack channel, or your on-call tool's notes field — is searchable, referenceable, and creates an audit trail for post-incident analysis.

Make the handoff a mandatory step in your on-call process, not an optional courtesy. Outgoing on-call engineers who skip the handoff should be reminded by their manager that the 15 minutes spent writing it will be repaid the next time they are the incoming engineer starting a shift cold.

Escalation Paths: Who to Call When

A well-defined escalation path answers three questions before an incident starts: who gets paged at each severity level, how long before the next escalation fires if there is no acknowledgment, and when does management get involved? Without clear answers to these questions, incident triage time is wasted on escalation decisions rather than remediation.

Escalation Matrix Template

A minimal escalation matrix for a team of 15–30 engineers covers three tiers:

  • P1 (Service down, all users affected): Primary on-call engineer (immediate) → Secondary on-call or service owner (5 min if unacknowledged) → Engineering manager (10 min if no triage started) → VP Engineering or CTO (30 min if P1 is unresolved)
  • P2 (Partial degradation, significant impact): Primary on-call engineer (immediate) → Engineering manager (15 min if unacknowledged). No executive escalation unless P2 is not resolved within 2 hours.
  • P3 (Minor impact, workaround available): Ticket routed to owning team. No page, no escalation. Next business day response.

The time limits before escalation are as important as the escalation targets themselves. Without explicit time limits, escalations are often delayed out of reluctance to wake senior engineers. That reluctance is understandable but counterproductive — a P1 that is silently burning while the on-call engineer works alone for 45 minutes without progress is worse than the awkwardness of a 2am page to the engineering manager.

Management Escalation for P1 Beyond 30 Minutes

Engineering managers should be in the escalation path for any P1 incident that is not resolved within 30 minutes. This is not punitive — it is operational. An engineering manager on a P1 can coordinate communication to customer success, update the status page, shield the engineering team from stakeholder interruptions, and make staffing decisions (who else needs to be called in) without those tasks falling to the on-call engineer who should be focused on remediation.

For more detail on structuring the incident response process end-to-end, see the guide to improving MTTR.

Compensating On-Call Engineers Fairly

On-call compensation is one of the most underinvested areas of engineering management. Many organizations treat on-call as an implicit part of the engineering role without explicit acknowledgment that it represents real labor — labor that occurs at nights and on weekends, that disrupts personal time, and that carries cognitive costs well beyond the hours directly spent responding to incidents.

On-Call Pay

The clearest signal that an organization values sustainable on-call is direct financial compensation for on-call shifts. Rates vary by organization and market, but a common structure is a flat per-shift stipend (typically $100–300 per week-long shift) plus an additional per-incident payment for P1 and P2 incidents that occur during the shift. The per-incident payment acknowledges that not all shifts are equal — a shift with three P1 incidents is categorically more demanding than a quiet shift, and flat compensation does not reflect that.

Compensatory Time Off

For teams that cannot offer direct on-call pay (common at early-stage startups), the next best compensation is compensatory time off. An engineer who handles a 3am P1 on a Sunday should not be expected at their desk at 9am Monday. A half-day or full-day comp time for each incident that disrupts sleep is the minimum acknowledgment that on-call labor is real and has recovery costs.

On-Call Bank for Extra Shifts

When engineers cover extra shifts — due to vacation, illness, or roster gaps — they should accrue on-call credits that can be redeemed for future time off or additional compensation. This prevents the resentment that builds when some engineers consistently absorb extra shifts without formal acknowledgment. Tracking extra shifts in a shared on-call bank, visible to the team, also creates social accountability: teammates can see who has covered extra burden and can volunteer to equalize it.

Building Psychological Safety Around On-Call

Psychological safety in on-call operations means engineers feel safe acknowledging mistakes, reporting near-misses, and engaging honestly in post-mortems — without fear that doing so will affect their performance reviews, reputation, or standing on the team. It is the cultural foundation that makes every other on-call improvement possible.

No Blame for Incidents

Every incident has a root cause at the system level, not the individual level. The engineer who merged the PR that caused the outage did so because: the review process did not catch the risk, the deployment pipeline did not flag it, the alerting did not detect it quickly enough, and the rollback tooling did not make recovery fast enough. These are system failures. Holding the individual engineer responsible for the system's failure to catch the issue is both factually wrong and practically harmful — it ensures that future engineers will be less transparent about risky changes, not more.

The no-blame principle must be modeled explicitly by engineering leadership. If managers or senior engineers assign blame in post-mortems, the cultural signal is clear regardless of what the written policy says. Blameless culture is a leadership behavior, not a policy document.

Post-Mortem as Learning Tool

The post-mortem is the most valuable artifact an incident can produce. Conducted well, it surfaces systemic weaknesses that would not be visible from dashboard metrics alone — alert coverage gaps, deployment process gaps, runbook gaps, knowledge silos. Conducted poorly (or not at all), it leaves those weaknesses invisible until the next incident.

Every P1 and P2 incident should produce a post-mortem within 48 hours of resolution. The post-mortem is not a report card — it is a structured retrospective with a timeline, a root cause analysis using 5 Whys, and concrete action items with owners and deadlines. Action items that do not get assigned an owner and a deadline are not action items — they are aspirations, and aspirations do not prevent the next incident.

Celebrating Good Incident Response

Recognition for handling an incident well is as important as the post-mortem analysis. Engineers who managed a difficult P1 calmly, communicated clearly, coordinated the response effectively, and drove rapid resolution should hear that explicitly from their manager. Public recognition in team channels for strong incident handling signals that incident response skill is valued — which makes engineers more willing to invest in developing it and more willing to take on on-call shifts.

Tools Comparison: PagerDuty vs. Incident.io vs. OpsGenie vs. Squadcast

Choosing the right on-call management tool has a meaningful impact on rotation health. The tools differ significantly in their rotation scheduling UI, escalation policy flexibility, alert grouping, and analytics depth. Here is how the four leading platforms compare for teams focused on on-call health.

ToolBest ForStrengthsLimitationsPricing Tier
PagerDutyEnterprise, complex escalation treesDeep analytics, AI noise reduction, mature integrations ecosystemExpensive at scale, UI complexity, steep learning curve$$$ (per user/month)
Incident.ioTeams that prioritize incident workflow and postmortemsBest-in-class incident workflow, Slack-native, strong postmortem toolingOn-call scheduling less mature than PagerDuty, fewer alert integrations$$ (per user/month)
OpsGenieAtlassian shops (Jira, Confluence, Statuspage)Tight Atlassian integration, solid rotation UI, competitive pricingAnalytics less deep than PagerDuty, incident workflow is secondary$ (affordable at scale)
SquadcastCost-conscious teams, startups, SMBsGenerous free tier, clean UI, good core on-call and escalation featuresSmaller integrations library, less enterprise-grade analyticsFree tier available

For teams choosing between PagerDuty and Incident.io specifically, the decision usually comes down to whether the primary pain point is alert routing complexity (PagerDuty wins) or incident workflow and communication quality (Incident.io wins). For teams already deep in the Atlassian stack, OpsGenie's integration depth makes it the path of least resistance. For teams with tight budgets and fewer than 15 engineers on rotation, Squadcast is the most cost-effective option that still covers the core use cases.

For a detailed breakdown of PagerDuty alternatives and how they compare on specific features, see the PagerDuty alternatives comparison.

Measuring On-Call Improvement Over Time

Improvement in on-call health is only visible if you track the right metrics consistently over time. One-time measurements tell you where you are. Trend data tells you whether the investments you are making are working.

Alert Count Per Week — Trend

The total number of alerts fired per week is the broadest indicator of alert hygiene health. A declining trend means your alert audit work is reducing noise. A flat trend means alert hygiene work is keeping pace with growth but not improving the absolute level. A rising trend is a warning signal that needs immediate investigation — alert volume should not grow linearly with system complexity if alert hygiene practices are in place.

Track alert count separately from incident count. A team that reduces alerts per week from 200 to 80 while incident count stays flat has dramatically improved signal-to-noise without reducing reliability — that is a genuine win that should be visible in the data.

MTTD Trend

Mean Time to Detect measures how quickly your monitoring system identifies incidents before they compound into larger outages. A declining MTTD trend indicates that your SLO-based alerting improvements are catching problems faster. An improving MTTD frequently correlates with declining MTTR — earlier detection gives engineers more time to remediate before user impact compounds.

Track MTTD separately for deploy-correlated incidents (where a deployment preceded the incident) and non-deploy incidents (infrastructure failures, external dependency issues). Deploy-correlated MTTD is most directly improved by deployment-triggered alerting watchdogs. Non-deploy MTTD is most improved by SLO burn rate alert calibration.

Sleep Disruption Events

Sleep disruption events — pages that fire between midnight and 6am local time — require either self-reporting from engineers or integration with your on-call tooling to extract automatically. PagerDuty and Incident.io both surface out-of-hours alert data in their analytics dashboards. Track the count of sleep disruption events per engineer per month and the trend over time.

A sustained downward trend in sleep disruption events is one of the clearest leading indicators of on-call retention improvement. Engineers who are not being woken up are engineers who are less likely to quietly plan their exit from the rotation. This metric is worth reporting to leadership as a concrete retention investment outcome.

Monthly on-call health reviews — 30 minutes, the five health metrics, trend charts, and one action item for the coming month — are more effective than quarterly reviews at driving sustained improvement. The feedback loop is short enough that the connection between the action and the outcome is visible.

Koalr surfaces MTTR trends and incident frequency from your on-call data

Connect PagerDuty or Incident.io and Koalr automatically pulls your incident history, calculates MTTR by severity tier, trends incident frequency over time, and correlates incidents with the deployments that preceded them. See whether your on-call improvement efforts are showing up in the metrics — without building the analytics yourself.

Where to Start

The full picture painted in this guide — SLO burn rate alerting, follow-the-sun rotations, blameless post-mortems, on-call pay programs, sleep disruption tracking — can feel overwhelming if your current on-call program consists of one PagerDuty schedule and a lot of hope. The good news is that the highest-leverage changes are the simplest ones.

Start with the alert-to-incident ratio audit. In one afternoon, review the alerts that fired in the last 30 days. Remove every alert with an auto-resolution rate above 30%. Remove every alert that is informational. Add a runbook link to every alert that remains. This single pass will typically reduce alert volume by 40–60% on teams that have not done this work before.

Then address rotation size. If you have fewer than four people on a rotation, build the case to expand it. The math is simple: one engineer lost to burnout costs more in recruiting, onboarding, and knowledge transfer than a full year of reduced sprint velocity from adding a rotation member.

Then start tracking the five health metrics. You cannot improve what you do not measure, and you cannot convince leadership to invest in on-call health without data showing the current state and the direction of travel.

The teams with the healthiest on-call programs did not get there by solving everything at once. They got there by treating on-call health as a first-class engineering problem — one worth the same investment, measurement, and iteration that they apply to system reliability. That framing shift is the most important thing in this guide.

See your MTTR trend and incident frequency in minutes

Connect PagerDuty or Incident.io and Koalr automatically tracks your MTTR by severity tier, incident frequency trend, and deployment-to-incident correlations. No configuration required — connect your on-call tool and the data appears. See whether your on-call health improvements are moving the metrics.