A structured approach to diagnosing live production incidents without causing further damage — covers log triage, metric spike correlation, deploy window filtering, and safe reproduction steps.
Use cases
- Service returning 5xx errors
- Memory leak suspected
- Latency spike in production
Key features
- Check error rate and latency dashboards for the affected service — identify the spike window
- Filter logs by service, severity, and time window — look for exceptions or unusual patterns
- Correlate the incident with recent deployments: which version is running, was there a deploy in the last hour?
- Check dependent service health — upstream failures often cascade downstream
- Identify a safe minimal reproduction: a curl or small script that reproduces the failure without side effects
- Open a targeted fix branch — never debug directly in production
- After fix, verify metrics return to baseline before marking incident resolved
Related
Related
3 Indexed items
Structured logging
Defines a small set of log fields (request id, user id, feature flag, latency bucket) so production debugging does not depend on grep across inconsistent printf strings.
Observability baselines
Defines golden signals, SLO windows, and dashboard checks before agents automate deploys—so assistants know what "healthy" means instead of guessing from noisy logs.
Performance profiling
Finds real bottlenecks using traces, flame graphs, and system metrics before rewriting code—so optimizations target measured latency, not guesses.