P

Skill Entry

Production debugging

Diagnoses live production incidents using log triage, metric spike correlation, deploy window filtering, and safe reproduction steps without causing further disruption. Production debugging applies systematic debugging principles in a live environment where the cost of wrong actions is high and the ability to reproduce the issue is limited.

Category Debugging
Platform Codex / Claude Code
Published 2026-04-25
debuggingproductionincident-response

Use cases

  • A service suddenly returning 5xx errors for a percentage of requests with no code deploy in the last hour
  • Suspected memory leak where heap usage grows gradually over days until the service restarts or degrades
  • Latency spike in production where p99 latency doubled for a specific endpoint without an obvious cause
  • An intermittent failure that occurs during peak traffic but not during off-peak, suggesting a resource contention issue
  • A third-party dependency degradation causing cascading failures in your service when your upstream provider is slow or returning errors

Key features

  • Check error rate and latency dashboards for the affected service, identifying the spike window and which endpoints or operations are degraded
  • Filter logs by service, severity, and time window, looking for exception patterns, unusual error types, or messages that appear only during the incident window
  • Correlate the incident with recent deployments: verify which version is running, whether there was a deploy in the last few hours, and what the baseline metrics looked like before the deploy
  • Check upstream and downstream service health—upstream failures often cascade downstream, and a latency spike in a dependency can manifest as an error in your service
  • Identify a minimal reproduction: a single curl request or small script that reproduces the failure without side effects, allowing you to verify the fix before deploying
  • Open a targeted fix branch never on production, apply the minimal fix, verify the reproduction no longer triggers, then follow the standard deployment process with canary monitoring
  • Confirm metrics return to baseline after deployment before marking the incident resolved and filing the follow-up investigation report

When to Use This Skill

  • When a production alert fires and you need to diagnose the root cause rapidly without causing additional disruption
  • When a service is degrading gradually and you need to identify the trigger before it becomes a full outage
  • When a user-reported issue cannot be reproduced in staging and needs investigation in the production environment

Expected Output

A root cause analysis with evidence from logs and metrics, a verified minimal reproduction, a deployed fix, and a confirmation that metrics returned to baseline.

Frequently Asked Questions

What is the most important rule of production debugging?
Never make changes directly on production. Open a fix branch, verify the fix in a non-production environment, and deploy through the standard pipeline with monitoring. The only exception is emergency mitigation (rollback, feature flag disable) that restores service while investigation continues.
How do I debug when I cannot identify a deploy as the trigger?
Check for upstream dependencies (database, cache, third-party APIs), scheduled jobs (cron tasks, batch jobs), traffic spikes (DDoS, viral content), and infrastructure events (auto-scaling, instance replacement). Look for any change, not just code deploys.
What if multiple services are affected simultaneously?
This typically points to a shared dependency: a shared database, a caching layer, a load balancer, or a downstream service that multiple services call. Investigate the shared infrastructure first before investigating individual services.

Related

Related

3 Indexed items

LangSmith production trace investigation playbook

Debugging

Turns LangSmith observability documentation into a repeatable incident workflow for LLM and agent outages: start from a failing run ID or thread, use the UI or LangSmith MCP tools (`fetch_runs`, `get_thread_history`) to reconstruct prompts, tool calls, and errors, then narrow scope with documented filters (run_type, is_root, FQL `filter` / `trace_filter` / `tree_filter`) before proposing code or prompt changes. The playbook cites official pagination rules (character-budget pages with `page_number` and `total_pages`) so investigators do not assume single-shot dumps, and it reminds teams to separate Cloud OAuth Remote MCP paths from self-hosted `LANGSMITH_ENDPOINT` configurations when collecting evidence.

Systematic debugging

Operations

Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.

Incident response

Operations

Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.