在线上故障诊断的结构化方法,涵盖日志分类、指标峰值关联、发布窗口过滤和安全复现步骤。
使用场景
- Service returning 5xx errors
- Memory leak suspected
- Latency spike in production
主要功能
- Check error rate and latency dashboards for the affected service — identify the spike window
- Filter logs by service, severity, and time window — look for exceptions or unusual patterns
- Correlate the incident with recent deployments: which version is running, was there a deploy in the last hour?
- Check dependent service health — upstream failures often cascade downstream
- Identify a safe minimal reproduction: a curl or small script that reproduces the failure without side effects
- Open a targeted fix branch — never debug directly in production
- After fix, verify metrics return to baseline before marking incident resolved
相关推荐
相关推荐
3 收录条目
LangSmith 生产 trace 排障手册
将 LangSmith 可观测文档落成可重复的线上故障流程:从失败 run 或 thread 出发,用 UI 或 MCP(fetch_runs、get_thread_history)还原 Prompt、工具调用与错误,再用文档中的 FQL 过滤缩小范围;强调字符预算分页,并区分 Cloud OAuth Remote MCP 与自建 LANGSMITH_ENDPOINT。
系统化调试
用假设—验证—最小复现替代拍脑袋,适合线上事故、构建抖动和难以复现的行为回归。
生产故障应急响应
从检测到解决的生产故障处理结构化流程——涵盖严重性评估、团队协调、沟通模板和事后分析要求。