E

Skill Entry

Error budget policy drafting

Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.

Category Operations
Platform Google SRE Workbook / Codex
Published 2026-05-13
reliabilityslopolicy

Use cases

  • Leadership asks for explicit rules linking deployment freezes to objective SLO measurements rather than subjective gut feel
  • Reliability champions need shared language explaining why change freezes are temporary guardrails, not punitive shutdowns
  • Incident reviewers must decide whether a large outage triggers mandatory postmortems tied to budget thresholds
  • Platform teams negotiating dependencies want clauses covering outages owned by another organization versus internal defects
  • Finance or product stakeholders escalate disagreements about whether miscategorized metrics distorted budget consumption

Key features

  • Document service scope (which binaries, clients, or datasets the policy covers) so everyone agrees where budgets apply
  • State explicit goals (protect users from recurring misses; incentivize balancing reliability with features) and non-goals (policy is not meant as punishment)
  • Describe the rolling measurement window used for budgets—Google’s appendix references four-week windows—and tie halt/resume criteria to documented SLO attainment
  • List freeze exceptions such as highest-severity defects or mandated security remediation so emergency fixes remain possible without violating governance intent
  • Add outage clauses mirroring the workbook thresholds—for example mandatory postmortems when a single incident consumes more than a defined fraction of the rolling budget—and specify required action-item severity
  • Publish escalation guidance when teams disagree about budget calculations or remediation priorities so disputes route to an accountable executive rather than stalling silently

When to Use This Skill

  • When adopting or revising organization-wide reliability governance tied to customer-facing SLOs
  • When incident retrospectives reveal mismatched expectations about whether releases should pause after repeated misses
  • When onboarding partner teams who inherit shared infrastructure budgets and need uniform enforcement rules

Expected Output

An error-budget policy document that aligns executives, product, and engineering on freeze triggers, exceptions, measurement windows, postmortem obligations, and escalation paths.

Frequently Asked Questions

Is freezing releases the default outcome?
Google’s template freezes broad changes only after the measured window exceeds its budget; healthy services continue shipping normally when performing at or above the SLO.
How does this relate to blamelessness?
The worked example stresses incentives rather than punishment—budget halts redirect attention toward systemic fixes while preserving psychological safety.
Can partner-caused outages consume our budget?
The appendix sketches carve-outs when outages originate outside the owning team or when miscategorized telemetry skews consumption; tailor similar clauses for your dependency graph.

Related

Related

3 Indexed items

Example SLO document authoring

Operations

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

AI economic benefit distribution readiness review

Operations

Converts public-policy and labor-relations guidance around AI-driven wealth into a planning checklist for organizations operating in semiconductor-heavy economies. Teams document how AI productivity gains translate—or fail to translate—into worker bonuses, public dividends, or reinvestment; assess concentration risk when chipmakers dominate equity indices; and prepare dialogue frameworks for recurring labor-management disputes as agentic automation scales. The skill cites CNBC reporting on South Korea's deputy prime minister urging that AI benefits reach the public amid Samsung strike negotiations, Kospi gains led by Samsung and SK Hynix, and debates over distributing AI-sector tax windfalls—without prescribing specific tax policies beyond verifying stakeholder messaging against cited facts.

Agentic coding vendor readiness review

Operations

Turns platform reliability and multi-vendor coding-agent guidance into a checklist before standardizing on a single AI coding stack. Teams inventory host-platform SLAs (for example GitHub availability incidents documented on githubstatus.com), compare primary and backup agents (GitHub Copilot, Cursor, Claude Code, Codex, etc.), verify observability hooks through Braintrust or similar tracing, and rehearse workflows when the code host or agent API is degraded. The skill cites public status pages and vendor billing changes—such as usage-based Copilot pricing announced on github.blog—so procurement and engineering sign off with eyes open about downtime, leadership churn, and feature parity gaps reported in trade media.