How to Run an Incident — Belgavi.AI Lab

Incident response is mostly drilled-in process, not heroics. The teams that recover fastest have practiced roles, clear channels, and ruthless focus on user impact over root cause. Here's the playbook.

Advertisement

Roles

Incident Commander (decides, doesn't fix). Comms (updates stakeholders). Ops (the fixer). Scribe (timeline log). Same person can wear multiple hats in small orgs but the roles should be explicit.

Channels

One ops channel (technical discussion). One status channel (broadcast to org). Status page updates externally if user-facing. Don't conflate channels — discussion noise drowns important updates.

Advertisement

During and after

During: focus on mitigation, not root cause. Roll back if you can. Communicate every 15-30 min even if 'still investigating'. After: blameless postmortem within a week, action items with owners, schedule the work.

Explicit roles + separated channels + mitigation-first + blameless postmortem. Practice the drill; don't invent it during an outage.