Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.
Use cases
- A new customer-facing microservice launches and leadership requests an auditable reliability contract before GA
- Observability data exists but no document ties metrics to UX outcomes or quantified targets
- Multiple surfaces (REST API vs static HTTP vs batch pipeline) share infrastructure and need partitioned SLO clauses
- Compliance asks how synthetic correctness coverage maps to allowable defect rates documented numerically
- Post-incident retros determine that vague “four nines somewhere” wording prevented consistent freeze decisions
Key features
- Summarize architectural context and customer-visible interfaces so reviewers know what system is in scope
- Choose a canonical evaluation window—the workbook example adopts a four-week rolling period—and state it verbatim in the preamble
- For each subsystem, enumerate SLIs with plain-language numerator/denominator definitions (availability counting non-5xx at the LB, latency thresholds referencing concrete milliseconds, freshness windows for derived reads, completeness for batch jobs)
- Set SLO percentages per SLI plus rationale paragraphs explaining historical measurement windows, rounding heuristics, and explicit caveats about evidence quality
- Derive discrete error budgets (100% minus target) independently per objective so finance and product leaders can negotiate trade-offs granularly
- Cross-reference the enacted error budget policy so readers know freeze behavior when any budget drains, and annotate clarifications (LB blind spots, prober workload assumptions, etc.)
When to Use This Skill
- When onboarding SRE practices modeled after Google workbook examples rather than reinventing unstructured uptime promises
- When pairing newly defined SLIs from monitoring design reviews with stakeholder sign-off artifacts
- When refactoring legacy SLAs into modern SLI/SLO narratives that align with iterative delivery
Expected Output
A concise SLO document mirroring Appendix A sections: overview, SLI/SLO table, rationale, discrete error budgets, clarifications/caveats, and links to enforcing policy.
Frequently Asked Questions
- Do we copy the exact gaming example numbers?
- Use the appendix as scaffolding; replace illustrative percentages, latency breakpoints, pipeline cadence, and prober sizing with telemetry grounded in your service—while preserving the explanatory structure Google uses.
- How granular should SLIs become?
- As granular as required for independent pacing—distinct API vs static web tiers vs freshness pipelines—as shown in Appendix A separate rows.
- What if instrumentation cannot yet prove user journeys?
- The workbook urges documenting evidence gaps plainly so future reviewers can prioritize better UX-linked metrics investments.
Related
Related
3 Indexed items
Postmortem trigger and root-cause taxonomy
Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.
Error budget policy drafting
Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.
Agentic coding vendor readiness review
Turns platform reliability and multi-vendor coding-agent guidance into a checklist before standardizing on a single AI coding stack. Teams inventory host-platform SLAs (for example GitHub availability incidents documented on githubstatus.com), compare primary and backup agents (GitHub Copilot, Cursor, Claude Code, Codex, etc.), verify observability hooks through Braintrust or similar tracing, and rehearse workflows when the code host or agent API is degraded. The skill cites public status pages and vendor billing changes—such as usage-based Copilot pricing announced on github.blog—so procurement and engineering sign off with eyes open about downtime, leadership churn, and feature parity gaps reported in trade media.