O

Skill Entry

Observability baselines

Establishes golden signals (latency, traffic, errors, saturation), SLO windows, and dashboard checks before agents automate deployments so that 'healthy' and 'degraded' have measurable definitions rather than subjective interpretations. This is essential when AI agents are managing deploys because agents need objective metrics to make decisions, not human gut feelings.

Category Operations
Platform Codex / Claude Code
Published 2026-04-17
observabilitysremetrics

Use cases

  • Onboarding a new service to the observability platform and needing to define what 'healthy' means from day one
  • Before automating a deployment pipeline and needing objective criteria for rollback versus proceed decisions
  • Analyzing a canary deployment where you need pre-defined thresholds to determine if the new version should be promoted
  • Setting up on-call runbooks where engineers need clear thresholds to decide when to escalate versus when to monitor
  • Defining SLOs for a new product feature where product and engineering need to agree on acceptable reliability levels

Key features

  • Identify the SLIs (Service Level Indicators) most tied to user pain: typically latency, error rate, and throughput for request-driven services
  • Define SLO targets for each SLI with a clear window (30-day rolling, calendar-based) and document what happens when the SLO is breached
  • Set error budgets—how much unreliability is acceptable over the window—based on the SLO, and wire alerts to burn rate rather than just threshold violations
  • Build dashboards that show current SLO status, error budget burn rate, and the top contributors to latency or errors
  • Link each alert to a runbook that specifies the action to take when the alert fires, so on-call engineers do not need to diagnose from first principles at 3am

When to Use This Skill

  • When building a new service and wanting observability defined before the first production deploy
  • When automating deployments and needing objective criteria for the automation to make decisions
  • When defining SLOs for a product feature where reliability expectations need to be agreed upon between product and engineering

Expected Output

Documented SLOs with SLIs, error budgets, dashboard definitions, and runbooks linked to alerts, ready to be implemented in the observability platform.

Frequently Asked Questions

How many SLOs should a service have?
Three to five at most—one per golden signal (latency, availability, error rate). More SLOs create maintenance overhead and dilute focus. Choose the SLIs that most directly affect user experience.
What if we cannot agree on SLO targets with the product team?
Start with a less aggressive target that you are confident you can meet and improve it over time as the system matures. An achievable SLO that is met beats an ambitious SLO that is perpetually breached.
How does observability baselines differ from structured logging?
Structured logging defines how to emit log data. Observability baselines define which metrics to collect, what they mean, and how to interpret them together. Structured logging is a prerequisite for observability, not a substitute for it.

Related

Related

3 Indexed items