The Anatomy of a Blameless Postmortem
A good postmortem turns one incident into a permanent organizational improvement. A bad postmortem turns it into finger-pointing and learned helplessness. The difference is structure: every effective postmortem covers the same 7 sections.
The 7 Sections Every Postmortem Needs
- Summary — 2-3 sentences. What broke, who was affected, for how long. Write this last.
- Impact — concrete numbers. Users affected, revenue lost, SLA credits owed, error budget burned. Use the Downtime Cost Calculator for the dollar figure.
- Timeline — every event with a timestamp, from first symptom to "all clear." Include detection time, escalation, mitigation attempts, root cause discovery.
- Root cause — what actually went wrong. Use the "5 whys" technique to get past the surface.
- Contributing factors — what made it worse or harder to fix. Missing monitoring, unclear runbooks, fragile dependencies.
- What went well — yes, this matters. People remember and repeat what's praised.
- Action items — specific, owned, with deadlines. "Improve monitoring" is not an action item. "Add SLO alert for X with P1 priority, owned by Alice, due 2026-06-01" is.
The Blameless Principle
Blameless does not mean "no one was at fault." It means we assume people acted with the best information they had at the time. The post-incident question is "what system failed to give them better information?" — not "why did Bob press the button?"
Practical rules: no individual names in root cause sections (use roles or "the on-call engineer"), no past-tense "should haves," and no language that implies the responder could have known something they didn't.
Severity Levels (SEV1-SEV4)
- SEV1 — Complete outage or data loss. Page immediately. All-hands.
- SEV2 — Major degradation, many users affected. Response <15 min. Cross-team.
- SEV3 — Minor impact, workaround exists. Response <1 hour. Single team.
- SEV4 — Cosmetic. Next business day. Often skips full postmortem.
Postmortem Cadence
Write postmortems for every SEV1 and SEV2. For SEV3, do them when there's a pattern (3+ similar incidents in a quarter). Review action items in a recurring weekly meeting — most postmortems fail because actions never get completed. Track action item closure rate as a team health metric.