A good runbook is the difference between a 2-minute fix and a 30-minute escalation. Bad runbooks are 'check the logs and figure it out'. Great runbooks let any oncall member resolve the alert without ramping up.

Advertisement

Structure

1. What this alert means (user impact). 2. How to verify (dashboard links, commands). 3. Quick mitigations (try these first). 4. Deeper diagnosis (decision tree). 5. Escalation (who to page if you're stuck after 20 min).

Be specific

'Check Postgres connection count' is bad. 'Run SELECT count(*) FROM pg_stat_activity; if > 800, scale up the connection pooler' is good. Concrete commands; concrete thresholds.

Advertisement

Update on every incident

After each incident, the runbook gets updated with what was new. No update = no learning. The runbook author isn't a single person; it's the team's accumulated playbook.

User impact + verify + try-these-mitigations + deeper diagnosis + escalation. Specific commands. Updated after every incident.