生产环境故障排查

在线上故障诊断的结构化方法，涵盖日志分类、指标峰值关联、发布窗口过滤和安全复现步骤。

分类调试与排障

平台 Codex / Claude Code

发布时间 2026-04-25

debuggingproductionincident-response

使用场景

Check error rate and latency dashboards for the affected service — identify the spike window
Filter logs by service, severity, and time window — look for exceptions or unusual patterns
Correlate the incident with recent deployments: which version is running, was there a deploy in the last hour?
Check dependent service health — upstream failures often cascade downstream
Identify a safe minimal reproduction: a curl or small script that reproduces the failure without side effects
Open a targeted fix branch — never debug directly in production
After fix, verify metrics return to baseline before marking incident resolved