How to Design a Fair On-Call Rotation
The best on-call rotations balance four things: fairness across the team, sustainable individual workload, predictable schedules, and clear handoff. Get these wrong and you get burnout, missed pages, and turnover.
Rotation Frequency: Daily, Weekly, or Monthly?
- Weekly (most common) — long enough to get context, short enough to avoid burnout. Use Monday or Wednesday handoff, never Friday.
- Daily — works for high-volume incident teams. Each engineer gets one bad day per N days. Requires excellent handoff docs.
- Monthly — bad. Too long to sustain without burnout. Avoid unless team is huge (10+) and incident volume is very low.
Primary + Secondary Coverage
A robust on-call rotation has two layers: primary (gets paged first) and secondary (gets paged if primary doesn't ack within N minutes, typically 5-15). Secondary should rotate independently from primary so no one is "always on" both. Add a manager escalation as third tier for major incidents.
On-Call Best Practices
- Comp time, not money — pay engineers in time off, not cash. Cash incentivizes bad alerts. Time off forces alert tuning.
- No on-call during PTO — swap shifts, don't expect coverage during vacation.
- Runbooks for every alert — if an alert can fire, it needs a runbook. Otherwise it's just noise.
- Page-budget per shift — if a shift sees more than 2 pages/night consistently, the alerts need tuning. Treat alert noise as a P1 bug.
- Handoff sync — 15-minute call at shift change to brief incoming on-call. Cover open incidents, suspicious metrics, planned changes.
Reducing On-Call Load with Better Monitoring
Most on-call burnout comes from alert fatigue, not actual incidents. The fix is fewer, higher-signal alerts: use confirmation thresholds (alert only after N consecutive failures), flap detection, and SLO-based alerting tied to error budget burn rate rather than raw error counts.