Error budget policy for service reliability Skill for Any engineering org

Implements the Google SRE practice of tying product velocity to measured reliability: define a service-level objective (SLO), derive an error budget from permitted unavailability or bad events, and govern launches, refactors, and feature freezes based on remaining budget. This skill operationalizes the error-budget policy described in Google’s SRE Workbook so teams make explicit trade-offs instead of debating reliability anecdotally.

Category Operations

Platform Any engineering org

Published 2026-05-11

sresloreliability

Use cases

Deciding whether to freeze risky releases after repeated outages
Negotiating launch timing between product and infrastructure teams
Prioritizing reliability hardening when user-visible errors consume budget quickly
Explaining to leadership why a feature must wait until error budget recovers
Designing quarterly reliability goals aligned with customer expectations

Key features

Pick user-journey SLIs (latency, success rate, freshness) tied to real user pain, not vanity dashboards
Set an SLO target and compute the error budget as 100% minus the SLO, applied over a rolling window
Define policy actions at budget thresholds: e.g., tighten change management, halt launches, or mandate fixits
Instrument burn-rate alerts so the team reacts before budget is fully depleted
Review policy quarterly: adjust SLOs when product promises or architecture materially change

When to Use This Skill

When outages recur and teams disagree on whether to slow shipping
When you need quantified guardrails instead of intuition-only reliability debates
When rolling out high-risk features that could breach customer-facing SLOs

Expected Output

A written error-budget policy covering SLIs, SLO, budget math, escalation thresholds, and launch freeze rules referenced to your monitoring data.

Frequently Asked Questions

Is an error budget only about uptime?: No. Google’s framing applies to any SLI where you can count bad events—failed requests, violated latency thresholds, or stale data—against an allowed margin.
What if we have no historical SLO data?: Start conservative with a provisional SLO and widen or tighten after collecting a few weeks of measured SLI data from production.
Does this replace incident response?: It complements it: incident response stops bleeding; error budgets guide proactive pacing of change before incidents stack up.

3 Indexed items

Git worktrees for isolation

Operations

Uses Git worktrees to create isolated working directories attached to the same repository, each on a different branch, so parallel experiments or long-running tasks do not interfere with the main working tree or require repeated stash-and-reapply cycles. This is especially useful when one branch requires a heavy build or test run while work continues on another.

SEO audit for web properties

Operations

Diagnoses indexing, crawlability, and on-page SEO issues across an entire site using automated crawls, Lighthouse checks, and structured output. An SEO audit surfaces actionable findings ranked by priority before manual review, making it possible to address critical issues quickly rather than discovering them through traffic drops.

Canary rollouts

Operations

Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.

Error budget policy for service reliability

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Git worktrees for isolation

SEO audit for web properties

Canary rollouts