Turns LangSmith observability documentation into a repeatable incident workflow for LLM and agent outages: start from a failing run ID or thread, use the UI or LangSmith MCP tools (`fetch_runs`, `get_thread_history`) to reconstruct prompts, tool calls, and errors, then narrow scope with documented filters (run_type, is_root, FQL `filter` / `trace_filter` / `tree_filter`) before proposing code or prompt changes. The playbook cites official pagination rules (character-budget pages with `page_number` and `total_pages`) so investigators do not assume single-shot dumps, and it reminds teams to separate Cloud OAuth Remote MCP paths from self-hosted `LANGSMITH_ENDPOINT` configurations when collecting evidence.
Use cases
- On-call receives a spike in 5xx or empty completions from a RAG route backed by LangChain
- Product reports a single customer thread where the assistant contradicted policy mid-conversation
- Release candidate shows higher p95 latency after a prompt swap without obvious infra regression
- Security review asks for evidence that tool calls stayed within approved scopes during an agent run
- Finance questions trace billing usage after a marketing campaign drives traffic
Key features
- Capture identifiers: project name, run UUID, thread ID, deployment version, and approximate timestamp window from the ticket.
- Pull the root run with `fetch_runs` (set `is_root`, pass `limit`, and use documented FQL filters) or open the equivalent trace in the LangSmith UI.
- If payloads truncate, page with `get_thread_history` or `fetch_runs` + `trace_id`, incrementing `page_number` until `total_pages` is exhausted.
- Map the failure to a layer: retrieval miss, tool schema mismatch, model refusal, rate limit, or downstream HTTP error—cite child spans rather than guessing.
- Compare against the last known-good prompt revision via `get_prompt_by_name` or prompt hub history before editing production templates.
- Record mitigations, owners, and whether a dataset example or online eval should guard the regression per LangSmith evaluation docs.
When to Use This Skill
- Production LLM regressions where unstructured application logs lack prompt/tool detail
- Handoffs between support and engineering that already standardize on LangSmith projects
- Postmortems requiring trace-backed timelines instead of chat screenshots
Expected Output
An incident note with run links, paginated evidence excerpts, hypothesized root cause tied to specific spans, and a verification plan (eval dataset or canary) before closing.
Frequently Asked Questions
- Do we need the MCP server for every investigation?
- No—the UI suffices for many cases; MCP accelerates agent-assisted triage when Cursor or Claude Code already has LangSmith credentials configured per docs.
- Why emphasize character pagination?
- LangSmith MCP docs cap page payloads by character budget (~25k default) to keep assistants within context limits—skipping pages loses tool outputs.
- Does this replace OWASP or threat modeling skills?
- No—it focuses on observability forensics; pair with security skills when investigating prompt injection or data exfiltration hypotheses.
Related
Related
3 Indexed items
Production debugging
Diagnoses live production incidents using log triage, metric spike correlation, deploy window filtering, and safe reproduction steps without causing further disruption. Production debugging applies systematic debugging principles in a live environment where the cost of wrong actions is high and the ability to reproduce the issue is limited.
Designing with LLM structured outputs
This skill covers when and how to ask an LLM for machine-readable payloads: define a JSON Schema (or the vendor's equivalent), enable the structured-output feature your provider documents, validate responses in application code, and handle refusals or validation errors explicitly. It applies to tool-calling agents, extraction pipelines, configuration emitters, and any workflow where brittle text parsing creates production risk.
Postmortem writing
Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.