Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.
Use cases
- Quarterly reliability reviews need evidence that remediation bandwidth targets the statistically dominant outage classes—not whichever executive was loudest last week
- You are designing tagging schemas for incidents and must separate ‘what flipped the breaker’ triggers from systemic root themes
- New SRE hires ask why templated postmortems insist on both trigger and contributing-factor narratives
- Capacity or performance teams need historical justification for preventative programs beyond incident severity anecdotes
- Security or compliance stakeholders want benchmarking language showing how fleets compare to canonical industry aggregates
Key features
- Copy the appendix’s trigger taxonomy verbatim into your glossary so incident authors share vocabulary (binary push vs configuration push, etc.)
- Populate each retrospective with both trigger percentages (what changed near the outage) and root-cause labels (why the system tolerated the trigger)
- Compare your quarterly trigger mix against the illustrative Google histogram only as a heuristic—never as a KPI target absent local baselines
- Highlight mismatches where a rare trigger dominates your queue (e.g., hardware-led incidents above baseline) and schedule targeted programs
- Pair quantitative distributions with qualitative storytelling so leadership understands percentages reference historical Google aggregates, not your live fleet
- Close the loop by tracking whether corrective actions correlate with shrinking slices of the offending root-cause category over subsequent quarters
When to Use This Skill
- During reliability strategy summits needing shared language rooted in textbook examples
- When codifying tagging requirements before importing incidents into warehouses or dashboards
- When coaching reviewers on the difference between surface triggers and underlying systemic faults
Expected Output
A succinct reference brief plus tagging guidance that aligns your incident warehouse fields with Appendix C’s illustrative trigger/root-cause split.
Frequently Asked Questions
- Are the cited percentages mandates for every company?
- No. Appendix C summarizes Google historical samples; adopt the taxonomy structure while replacing numbers with telemetry from your fleet.
- How does this coexist with bespoke postmortem templates?
- Treat Appendix C dimensions as orthogonal tags layered atop your narrative template so analytics stay comparable across services.
- What if our dominant trigger is absent from the table?
- Extend the taxonomy—Appendix C is illustrative, not exhaustive—and document the rationale for new categories so trends remain explainable.
Related
Related
3 Indexed items
Example SLO document authoring
Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.
Postmortem writing
Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.
Error budget policy drafting
Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.