Alert fatigue is the single biggest signal that an SRE practice is broken. The fix is structural, not 'try harder'. Symptom-based alerts, multi-window burn rates, and aggressive deletion of low-signal alerts move teams from 'paged 3x/night' to 'paged when it actually matters'.
Symptoms, not causes
Alert on 'checkout error rate >2%', not 'Postgres connection count >800'. The cause-based alert pages whether or not it affects users; the symptom-based alert pages when users hurt. Causes go in dashboards, not pages.
Multi-window burn rate
Single threshold alerts are noisy. Use Google's burn-rate pattern: page if (5-min window burning > 14× AND 1-hr window burning > 14×) for fast-burn, separately for slow-burn. Catches both spikes and steady leaks; ignores transient blips.
Auto-resolve, not auto-page
If the condition self-recovers within 10 minutes, don't page. Aggregate into a 'this happened N times today' summary email. Most transient alerts shouldn't have woken anyone up.
Prune ruthlessly
Monthly review: every alert that fired. Why did it fire? What action did oncall take? If no action, delete or downgrade. 'It was useful once last quarter' doesn't justify ongoing pages.
Pager culture matters
Engineers must trust the pager. If oncall ignores half their pages, the bad alerts are training them to ignore the good ones. The right number of pages per shift is 0-2; >5 means the system is broken, not oncall is lazy.