Implements the Google SRE practice of tying product velocity to measured reliability: define a service-level objective (SLO), derive an error budget from permitted unavailability or bad events, and govern launches, refactors, and feature freezes based on remaining budget. This skill operationalizes the error-budget policy described in Google’s SRE Workbook so teams make explicit trade-offs instead of debating reliability anecdotally.
Use cases
- Deciding whether to freeze risky releases after repeated outages
- Negotiating launch timing between product and infrastructure teams
- Prioritizing reliability hardening when user-visible errors consume budget quickly
- Explaining to leadership why a feature must wait until error budget recovers
- Designing quarterly reliability goals aligned with customer expectations
Key features
- Pick user-journey SLIs (latency, success rate, freshness) tied to real user pain, not vanity dashboards
- Set an SLO target and compute the error budget as 100% minus the SLO, applied over a rolling window
- Define policy actions at budget thresholds: e.g., tighten change management, halt launches, or mandate fixits
- Instrument burn-rate alerts so the team reacts before budget is fully depleted
- Review policy quarterly: adjust SLOs when product promises or architecture materially change
When to Use This Skill
- When outages recur and teams disagree on whether to slow shipping
- When you need quantified guardrails instead of intuition-only reliability debates
- When rolling out high-risk features that could breach customer-facing SLOs
Expected Output
A written error-budget policy covering SLIs, SLO, budget math, escalation thresholds, and launch freeze rules referenced to your monitoring data.
Frequently Asked Questions
- Is an error budget only about uptime?
- No. Google’s framing applies to any SLI where you can count bad events—failed requests, violated latency thresholds, or stale data—against an allowed margin.
- What if we have no historical SLO data?
- Start conservative with a provisional SLO and widen or tighten after collecting a few weeks of measured SLI data from production.
- Does this replace incident response?
- It complements it: incident response stops bleeding; error budgets guide proactive pacing of change before incidents stack up.
Related
Related
3 Indexed items
Git worktrees for isolation
Uses Git worktrees to create isolated working directories attached to the same repository, each on a different branch, so parallel experiments or long-running tasks do not interfere with the main working tree or require repeated stash-and-reapply cycles. This is especially useful when one branch requires a heavy build or test run while work continues on another.
SEO audit for web properties
Diagnoses indexing, crawlability, and on-page SEO issues across an entire site using automated crawls, Lighthouse checks, and structured output. An SEO audit surfaces actionable findings ranked by priority before manual review, making it possible to address critical issues quickly rather than discovering them through traffic drops.
Canary rollouts
Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.