Alerting That Doesn't Burn Out Oncall

Alert fatigue is the single biggest signal that an SRE practice is broken. The fix is structural, not 'try harder'. Symptom-based alerts, multi-window burn rates, and aggressive deletion of low-signal alerts move teams from 'paged 3x/night' to 'paged when it actually matters'.

Advertisement

Symptoms, not causes

Alert on 'checkout error rate >2%', not 'Postgres connection count >800'. The cause-based alert pages whether or not it affects users; the symptom-based alert pages when users hurt. Causes go in dashboards, not pages.

Multi-window burn rate

Single threshold alerts are noisy. Use Google's burn-rate pattern: page if (5-min window burning > 14× AND 1-hr window burning > 14×) for fast-burn, separately for slow-burn. Catches both spikes and steady leaks; ignores transient blips.

Advertisement

Auto-resolve, not auto-page

If the condition self-recovers within 10 minutes, don't page. Aggregate into a 'this happened N times today' summary email. Most transient alerts shouldn't have woken anyone up.

Prune ruthlessly

Monthly review: every alert that fired. Why did it fire? What action did oncall take? If no action, delete or downgrade. 'It was useful once last quarter' doesn't justify ongoing pages.

Pager culture matters

Engineers must trust the pager. If oncall ignores half their pages, the bad alerts are training them to ignore the good ones. The right number of pages per shift is 0-2; >5 means the system is broken, not oncall is lazy.

Page on user-visible symptoms. Multi-window burn rates. Delete low-signal alerts monthly. Protect oncall's trust in the pager.