Diagnoses live production incidents using log triage, metric spike correlation, deploy window filtering, and safe reproduction steps without causing further disruption. Production debugging applies systematic debugging principles in a live environment where the cost of wrong actions is high and the ability to reproduce the issue is limited.
Use cases
- A service suddenly returning 5xx errors for a percentage of requests with no code deploy in the last hour
- Suspected memory leak where heap usage grows gradually over days until the service restarts or degrades
- Latency spike in production where p99 latency doubled for a specific endpoint without an obvious cause
- An intermittent failure that occurs during peak traffic but not during off-peak, suggesting a resource contention issue
- A third-party dependency degradation causing cascading failures in your service when your upstream provider is slow or returning errors
Key features
- Check error rate and latency dashboards for the affected service, identifying the spike window and which endpoints or operations are degraded
- Filter logs by service, severity, and time window, looking for exception patterns, unusual error types, or messages that appear only during the incident window
- Correlate the incident with recent deployments: verify which version is running, whether there was a deploy in the last few hours, and what the baseline metrics looked like before the deploy
- Check upstream and downstream service health—upstream failures often cascade downstream, and a latency spike in a dependency can manifest as an error in your service
- Identify a minimal reproduction: a single curl request or small script that reproduces the failure without side effects, allowing you to verify the fix before deploying
- Open a targeted fix branch never on production, apply the minimal fix, verify the reproduction no longer triggers, then follow the standard deployment process with canary monitoring
- Confirm metrics return to baseline after deployment before marking the incident resolved and filing the follow-up investigation report
When to Use This Skill
- When a production alert fires and you need to diagnose the root cause rapidly without causing additional disruption
- When a service is degrading gradually and you need to identify the trigger before it becomes a full outage
- When a user-reported issue cannot be reproduced in staging and needs investigation in the production environment
Expected Output
A root cause analysis with evidence from logs and metrics, a verified minimal reproduction, a deployed fix, and a confirmation that metrics returned to baseline.
Frequently Asked Questions
- What is the most important rule of production debugging?
- Never make changes directly on production. Open a fix branch, verify the fix in a non-production environment, and deploy through the standard pipeline with monitoring. The only exception is emergency mitigation (rollback, feature flag disable) that restores service while investigation continues.
- How do I debug when I cannot identify a deploy as the trigger?
- Check for upstream dependencies (database, cache, third-party APIs), scheduled jobs (cron tasks, batch jobs), traffic spikes (DDoS, viral content), and infrastructure events (auto-scaling, instance replacement). Look for any change, not just code deploys.
- What if multiple services are affected simultaneously?
- This typically points to a shared dependency: a shared database, a caching layer, a load balancer, or a downstream service that multiple services call. Investigate the shared infrastructure first before investigating individual services.
Related
Related
3 Indexed items
LangSmith production trace investigation playbook
Turns LangSmith observability documentation into a repeatable incident workflow for LLM and agent outages: start from a failing run ID or thread, use the UI or LangSmith MCP tools (`fetch_runs`, `get_thread_history`) to reconstruct prompts, tool calls, and errors, then narrow scope with documented filters (run_type, is_root, FQL `filter` / `trace_filter` / `tree_filter`) before proposing code or prompt changes. The playbook cites official pagination rules (character-budget pages with `page_number` and `total_pages`) so investigators do not assume single-shot dumps, and it reminds teams to separate Cloud OAuth Remote MCP paths from self-hosted `LANGSMITH_ENDPOINT` configurations when collecting evidence.
Systematic debugging
Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.
Incident response
Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.