A structured approach to diagnosing live production incidents without causing further damage — covers log triage, metric spike correlation, deploy window filtering, and safe reproduction steps.
使用场景
- Service returning 5xx errors
- Memory leak suspected
- Latency spike in production
主要功能
- Check error rate and latency dashboards for the affected service — identify the spike window
- Filter logs by service, severity, and time window — look for exceptions or unusual patterns
- Correlate the incident with recent deployments: which version is running, was there a deploy in the last hour?
- Check dependent service health — upstream failures often cascade downstream
- Identify a safe minimal reproduction: a curl or small script that reproduces the failure without side effects
- Open a targeted fix branch — never debug directly in production
- After fix, verify metrics return to baseline before marking incident resolved
相关推荐
相关推荐
3 收录条目