Postmortem trigger and root-cause taxonomy Skill for Google SRE Workbook / Codex

Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.

Category Operations

Platform Google SRE Workbook / Codex

Published 2026-05-15

postmortemreliabilityincidents

Use cases

Quarterly reliability reviews need evidence that remediation bandwidth targets the statistically dominant outage classes—not whichever executive was loudest last week
You are designing tagging schemas for incidents and must separate ‘what flipped the breaker’ triggers from systemic root themes
New SRE hires ask why templated postmortems insist on both trigger and contributing-factor narratives
Capacity or performance teams need historical justification for preventative programs beyond incident severity anecdotes
Security or compliance stakeholders want benchmarking language showing how fleets compare to canonical industry aggregates

Key features

Copy the appendix’s trigger taxonomy verbatim into your glossary so incident authors share vocabulary (binary push vs configuration push, etc.)
Populate each retrospective with both trigger percentages (what changed near the outage) and root-cause labels (why the system tolerated the trigger)
Compare your quarterly trigger mix against the illustrative Google histogram only as a heuristic—never as a KPI target absent local baselines
Highlight mismatches where a rare trigger dominates your queue (e.g., hardware-led incidents above baseline) and schedule targeted programs
Pair quantitative distributions with qualitative storytelling so leadership understands percentages reference historical Google aggregates, not your live fleet
Close the loop by tracking whether corrective actions correlate with shrinking slices of the offending root-cause category over subsequent quarters

When to Use This Skill

During reliability strategy summits needing shared language rooted in textbook examples
When codifying tagging requirements before importing incidents into warehouses or dashboards
When coaching reviewers on the difference between surface triggers and underlying systemic faults

Expected Output

A succinct reference brief plus tagging guidance that aligns your incident warehouse fields with Appendix C’s illustrative trigger/root-cause split.

Frequently Asked Questions

Are the cited percentages mandates for every company?: No. Appendix C summarizes Google historical samples; adopt the taxonomy structure while replacing numbers with telemetry from your fleet.
How does this coexist with bespoke postmortem templates?: Treat Appendix C dimensions as orthogonal tags layered atop your narrative template so analytics stay comparable across services.
What if our dominant trigger is absent from the table?: Extend the taxonomy—Appendix C is illustrative, not exhaustive—and document the rationale for new categories so trends remain explainable.

3 Indexed items

Example SLO document authoring

Operations

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

Postmortem writing

Operations

Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.

Error budget policy drafting

Operations

Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.

Postmortem trigger and root-cause taxonomy

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Example SLO document authoring

Postmortem writing

Error budget policy drafting