Observability baselines Skill for Codex / Claude Code

Establishes golden signals (latency, traffic, errors, saturation), SLO windows, and dashboard checks before agents automate deployments so that 'healthy' and 'degraded' have measurable definitions rather than subjective interpretations. This is essential when AI agents are managing deploys because agents need objective metrics to make decisions, not human gut feelings.

Category Operations

Platform Codex / Claude Code

Published 2026-04-17

observabilitysremetrics

Use cases

Onboarding a new service to the observability platform and needing to define what 'healthy' means from day one
Before automating a deployment pipeline and needing objective criteria for rollback versus proceed decisions
Analyzing a canary deployment where you need pre-defined thresholds to determine if the new version should be promoted
Setting up on-call runbooks where engineers need clear thresholds to decide when to escalate versus when to monitor
Defining SLOs for a new product feature where product and engineering need to agree on acceptable reliability levels

Key features

Identify the SLIs (Service Level Indicators) most tied to user pain: typically latency, error rate, and throughput for request-driven services
Define SLO targets for each SLI with a clear window (30-day rolling, calendar-based) and document what happens when the SLO is breached
Set error budgets—how much unreliability is acceptable over the window—based on the SLO, and wire alerts to burn rate rather than just threshold violations
Build dashboards that show current SLO status, error budget burn rate, and the top contributors to latency or errors
Link each alert to a runbook that specifies the action to take when the alert fires, so on-call engineers do not need to diagnose from first principles at 3am

When to Use This Skill

When building a new service and wanting observability defined before the first production deploy
When automating deployments and needing objective criteria for the automation to make decisions
When defining SLOs for a product feature where reliability expectations need to be agreed upon between product and engineering

Expected Output

Documented SLOs with SLIs, error budgets, dashboard definitions, and runbooks linked to alerts, ready to be implemented in the observability platform.

Frequently Asked Questions

How many SLOs should a service have?: Three to five at most—one per golden signal (latency, availability, error rate). More SLOs create maintenance overhead and dilute focus. Choose the SLIs that most directly affect user experience.
What if we cannot agree on SLO targets with the product team?: Start with a less aggressive target that you are confident you can meet and improve it over time as the system matures. An achievable SLO that is met beats an ambitious SLO that is perpetually breached.
How does observability baselines differ from structured logging?: Structured logging defines how to emit log data. Observability baselines define which metrics to collect, what they mean, and how to interpret them together. Structured logging is a prerequisite for observability, not a substitute for it.

3 Indexed items

Canary rollouts

Operations

Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.

Performance profiling

Operations

Finds genuine performance bottlenecks using CPU profiles, flame graphs, memory traces, and system metrics under realistic load before rewriting code. This prevents the common anti-pattern of spending days optimizing code paths that are not in the critical path, based on intuition rather than measurement.

Anthropic Fable/Mythos export ban lifted due diligence

Operations

Structures CNN reporting on June 30, 2026 that the US government lifted export controls on Anthropic's Claude Fable 5 and Mythos 5 into a policy, security, and release-governance checklist. The workflow separates verified facts—Anthropic said Commerce lifted export controls on Fable 5 and Mythos 5 and would begin restoring access; Commerce Secretary Howard Lutnick posted on X about lifting controls on Fable after two weeks working with Anthropic to analyze and approve Fable 5; Fable is Mythos with extra public guardrails; earlier export ban followed Amazon finding a jailbreak (Anthropic called jailbreaks simple and noted similar work-arounds on other public models); Anthropic implemented a new safeguard blocking reported vulnerabilities; Commerce had required suspending all foreign-national use including Anthropic employees; Mythos was later released to select government-approved entities; CNN notes experts say Mythos can exploit cyber vulnerabilities at unprecedented pace; White House also asked OpenAI to limit GPT 5.6 release to government-approved partners—from internal frontier-model access planning. Distinct from June ban-imposition reporting tracked by anthropic-mythos-export-control-due-diligence.

Observability baselines

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Canary rollouts

Performance profiling

Anthropic Fable/Mythos export ban lifted due diligence

Related news