Incident response Skill for Codex / Claude Code

Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.

Category Operations

Platform Codex / Claude Code

Published 2026-04-29

incidentoperationson-call

Use cases

A production service is completely down and users are unable to access core functionality
A partial outage affecting a subset of users, such as a specific region or user tier
Performance degradation triggering automated alerts but not yet a full outage
A data integrity issue where incorrect data is being shown to users
A security incident where unauthorized access is suspected or confirmed

Key features

Assess severity and assign a grade: P0 for complete outage, P1 for major feature broken, P2 for degraded experience, P3 for minor issue with workarounds available
Declare the incident in the designated channel with severity, impact description, and your name as incident commander, then assemble the response team
Begin mitigation immediately—roll back the last deployment, disable a feature flag, or activate a circuit breaker to restore service before investigating root cause
Communicate status to affected users via the status page within 15 minutes of declaration, and provide updates at regular intervals until resolution
Investigate root cause in parallel with monitoring, using dashboards and structured logs rather than speculation about what might have changed
When service is restored, update the status page immediately and schedule a post-mortem meeting within 48 hours with all involved parties
Write the post-mortem document covering the full timeline, root cause analysis, contributing factors, and concrete action items with owners and deadlines

When to Use This Skill

When a production alert fires and you are the first engineer to acknowledge it
When a service degradation is reported by users before automated alerts have fired
When a deployment causes unexpected behavior in production and rollback is being considered

Expected Output

A resolved incident with a timeline, root cause analysis, and an action-tracked post-mortem document filed within 48 hours of resolution.

Frequently Asked Questions

Should I always roll back during an incident before investigating?: Roll back if the recent deployment is the likely cause and rollback is fast and safe. If rollback is risky or the cause is unclear, focus on mitigation (feature flags, traffic shifting) while investigating. Never let investigation delay communication to users.
How do I run an effective post-mortem without blaming individuals?: Focus on systemic factors: process gaps, tooling failures, missing automation, or unclear ownership. Use 'contributing factors' language rather than 'human error' framing. The goal is to improve systems, not to find fault.
What if an incident recurs because the post-mortem action items were not completed?: Escalate the stalled action items with a clear statement that they are now overdue. Recurring incidents caused by incomplete remediations should be treated as higher severity and escalated to senior leadership.

3 Indexed items

Systematic debugging

Operations

Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.

Agentic AI orchestration efficiency claims due diligence

Operations

Turns CEO and vendor narratives about agentic AI efficiency into a procurement and strategy checklist. The workflow separates quoted efficiency metrics (for example token- or energy-per-user framing) from product launch facts, orchestration architecture claims, and third-party valuation context in the same article. It references CNBC reporting on June 3, 2026 that Perplexity CEO Aravind Srinivas told CNBC's Elaine Yu the long-term AI winner will maximize what he called the "most taken value per watt per user" by balancing accuracy, latency, cost, privacy, and intelligence; that Perplexity is emphasizing agentic orchestration with Perplexity Computer (announced February) and Personal Computer on Windows (announced the prior Tuesday, with Mac already available); that Srinivas said Personal Computer routes processing between device and cloud; that Perplexity was last reportedly valued at $20 billion versus Anthropic near $1 trillion and OpenAI just over $850 billion with Anthropic confidentially filing for a U.S. IPO that week; and that Srinivas cited tripled annualized revenue since the start of the year tied to integrated Anthropic model improvements—without treating media valuations or CEO efficiency slogans as internal benchmarks.

Postmortem trigger and root-cause taxonomy

Operations

Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.

Incident response

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Systematic debugging

Agentic AI orchestration efficiency claims due diligence

Postmortem trigger and root-cause taxonomy

Related news