E

Skill Entry

Error budget policy for service reliability

Implements the Google SRE practice of tying product velocity to measured reliability: define a service-level objective (SLO), derive an error budget from permitted unavailability or bad events, and govern launches, refactors, and feature freezes based on remaining budget. This skill operationalizes the error-budget policy described in Google’s SRE Workbook so teams make explicit trade-offs instead of debating reliability anecdotally.

Category Operations
Platform Any engineering org
Published 2026-05-11
sresloreliability

Use cases

  • Deciding whether to freeze risky releases after repeated outages
  • Negotiating launch timing between product and infrastructure teams
  • Prioritizing reliability hardening when user-visible errors consume budget quickly
  • Explaining to leadership why a feature must wait until error budget recovers
  • Designing quarterly reliability goals aligned with customer expectations

Key features

  • Pick user-journey SLIs (latency, success rate, freshness) tied to real user pain, not vanity dashboards
  • Set an SLO target and compute the error budget as 100% minus the SLO, applied over a rolling window
  • Define policy actions at budget thresholds: e.g., tighten change management, halt launches, or mandate fixits
  • Instrument burn-rate alerts so the team reacts before budget is fully depleted
  • Review policy quarterly: adjust SLOs when product promises or architecture materially change

When to Use This Skill

  • When outages recur and teams disagree on whether to slow shipping
  • When you need quantified guardrails instead of intuition-only reliability debates
  • When rolling out high-risk features that could breach customer-facing SLOs

Expected Output

A written error-budget policy covering SLIs, SLO, budget math, escalation thresholds, and launch freeze rules referenced to your monitoring data.

Frequently Asked Questions

Is an error budget only about uptime?
No. Google’s framing applies to any SLI where you can count bad events—failed requests, violated latency thresholds, or stale data—against an allowed margin.
What if we have no historical SLO data?
Start conservative with a provisional SLO and widen or tighten after collecting a few weeks of measured SLI data from production.
Does this replace incident response?
It complements it: incident response stops bleeding; error budgets guide proactive pacing of change before incidents stack up.

Related

Related

3 Indexed items