Alert fatigue is the #1 cause of on-call burnout and missed incidents. When every dashboard pages someone every hour, real incidents drown in noise. Three layers fix it: better alert design, automated correlation, on-call rotation hygiene.
Symptom-based, not cause-based
Bad: 'CPU > 80% on web-04'. Good: 'p95 latency > 500ms on /checkout for 5 minutes'. Symptoms map to user pain; causes generate noise. Alert on what users feel, not on what the box does internally.
Multi-window multi-burn-rate
Single threshold alerts (>1% error rate for 5min) fire too often. Google SRE pattern: alert if 5min burn-rate AND 1hr burn-rate both exceed thresholds proportional to your SLO. Fires only on real budget burns, not spikes.
Alert correlation
One root cause (DB outage) fires 50 downstream alerts. Tools: PagerDuty Event Intelligence, BigPanda, OpsRamp — cluster related alerts into one incident. Reduces page volume 5-10x without losing information.
Auto-resolve
Alert fired because of a transient — by the time the human looks, it's fine. The alert should auto-resolve when the condition clears for N minutes. Saves the human a 'check and ack' interaction per false-positive.
On-call rotation health metrics
Per-person per-week: number of pages, % paged outside business hours, time-to-respond. If anyone gets >5 pages/week or >1 night page/week, the rotation is unhealthy. Track these and treat them as engineering bugs, not as 'on-call life'.