P

Skill Entry

Production debugging

A structured approach to diagnosing live production incidents without causing further damage — covers log triage, metric spike correlation, deploy window filtering, and safe reproduction steps.

分类 debugging
平台 Codex / Claude Code
发布时间 2026-04-25
debuggingproductionincident-response

使用场景

  • Service returning 5xx errors
  • Memory leak suspected
  • Latency spike in production

主要功能

  • Check error rate and latency dashboards for the affected service — identify the spike window
  • Filter logs by service, severity, and time window — look for exceptions or unusual patterns
  • Correlate the incident with recent deployments: which version is running, was there a deploy in the last hour?
  • Check dependent service health — upstream failures often cascade downstream
  • Identify a safe minimal reproduction: a curl or small script that reproduces the failure without side effects
  • Open a targeted fix branch — never debug directly in production
  • After fix, verify metrics return to baseline before marking incident resolved

相关推荐

相关推荐

3 收录条目