Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.
Use cases
- A production service is completely down and users are unable to access core functionality
- A partial outage affecting a subset of users, such as a specific region or user tier
- Performance degradation triggering automated alerts but not yet a full outage
- A data integrity issue where incorrect data is being shown to users
- A security incident where unauthorized access is suspected or confirmed
Key features
- Assess severity and assign a grade: P0 for complete outage, P1 for major feature broken, P2 for degraded experience, P3 for minor issue with workarounds available
- Declare the incident in the designated channel with severity, impact description, and your name as incident commander, then assemble the response team
- Begin mitigation immediately—roll back the last deployment, disable a feature flag, or activate a circuit breaker to restore service before investigating root cause
- Communicate status to affected users via the status page within 15 minutes of declaration, and provide updates at regular intervals until resolution
- Investigate root cause in parallel with monitoring, using dashboards and structured logs rather than speculation about what might have changed
- When service is restored, update the status page immediately and schedule a post-mortem meeting within 48 hours with all involved parties
- Write the post-mortem document covering the full timeline, root cause analysis, contributing factors, and concrete action items with owners and deadlines
When to Use This Skill
- When a production alert fires and you are the first engineer to acknowledge it
- When a service degradation is reported by users before automated alerts have fired
- When a deployment causes unexpected behavior in production and rollback is being considered
Expected Output
A resolved incident with a timeline, root cause analysis, and an action-tracked post-mortem document filed within 48 hours of resolution.
Frequently Asked Questions
- Should I always roll back during an incident before investigating?
- Roll back if the recent deployment is the likely cause and rollback is fast and safe. If rollback is risky or the cause is unclear, focus on mitigation (feature flags, traffic shifting) while investigating. Never let investigation delay communication to users.
- How do I run an effective post-mortem without blaming individuals?
- Focus on systemic factors: process gaps, tooling failures, missing automation, or unclear ownership. Use 'contributing factors' language rather than 'human error' framing. The goal is to improve systems, not to find fault.
- What if an incident recurs because the post-mortem action items were not completed?
- Escalate the stalled action items with a clear statement that they are now overdue. Recurring incidents caused by incomplete remediations should be treated as higher severity and escalated to senior leadership.
Related
Related
3 Indexed items
Systematic debugging
Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.
Postmortem trigger and root-cause taxonomy
Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.
Content refresh
Runs a scheduled audit of existing tool, MCP, skill, and news entries to identify and address stale pricing, broken documentation links, outdated capabilities, and weakened prose that quietly degrades directory quality. This maintenance rhythm prevents the directory from accumulating digital rot as tools evolve and entries grow outdated.