Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.
Use cases
- A customer-facing outage that lasted more than 30 minutes and affected a measurable percentage of users
- A data integrity incident where incorrect data was served or stored, even if the error was quickly corrected
- A repeat incident where the same failure mode occurred within 90 days of a previous postmortem
- A near-miss where a failure was caught by automation before it became user-visible but the risk was significant
- An incident triggered by a change that passed all CI checks and was approved by a senior engineer, revealing a gap in the review process
Key features
- Freeze the factual timeline as soon as the incident is resolved while memories are fresh—capture when the alert fired, when the engineer engaged, when mitigation began, and when service was restored
- Separate root cause (the underlying systemic flaw that allowed the incident to occur) from triggers (the immediate event that started the cascade)—resist conflating the two
- Identify contributing factors: process gaps, missing automation, unclear ownership, or tooling failures that made the incident worse or harder to detect
- File specific, tracked remediations with named owners and deadlines—not vague suggestions like 'improve monitoring' but concrete actions like 'add alert on p99 latency > 2s for /api/checkout'
- Review the postmortem in a blameless meeting with all involved parties, update it based on discussion, and publish it to the team within 48 hours of the incident
When to Use This Skill
- When an incident has been resolved and you need to capture learnings before the team moves on to other work
- When a repeat incident reveals that a previous postmortem's action items were not completed
- When an incident had significant user impact and stakeholders need a formal explanation
Expected Output
A published postmortem document with a factual timeline, root cause analysis, contributing factors, and tracked action items with owners and deadlines.
Frequently Asked Questions
- How do I write a postmortem without blaming individuals?
- Focus on systemic failures: what process, tooling, or information gap allowed a human to make a mistake? Write 'the deployment process did not catch this' rather than 'the engineer deployed without checking.' The blameless culture is structural, not cosmetic.
- What if the root cause is unclear and we cannot agree on it?
- Document multiple hypotheses with the evidence for and against each. If consensus cannot be reached, err on the side of more monitoring or automation rather than leaving a gap unaddressed. A postmortem with imperfect analysis but action items is better than no postmortem.
- How do I follow up on postmortem action items that are not completed?
- Review open action items weekly in the engineering all-hands. Uncompleted items from a postmortem should be escalated if they are blocking the prevention of a similar incident. Treat overdue action items as high-priority incidents waiting to happen.
Related
Related
3 Indexed items
Postmortem trigger and root-cause taxonomy
Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.
Canary rollouts
Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.
Agentic coding vendor readiness review
Turns platform reliability and multi-vendor coding-agent guidance into a checklist before standardizing on a single AI coding stack. Teams inventory host-platform SLAs (for example GitHub availability incidents documented on githubstatus.com), compare primary and backup agents (GitHub Copilot, Cursor, Claude Code, Codex, etc.), verify observability hooks through Braintrust or similar tracing, and rehearse workflows when the code host or agent API is degraded. The skill cites public status pages and vendor billing changes—such as usage-based Copilot pricing announced on github.blog—so procurement and engineering sign off with eyes open about downtime, leadership churn, and feature parity gaps reported in trade media.