LangSmith production trace investigation playbook Skill for LangSmith / LangChain

Turns LangSmith observability documentation into a repeatable incident workflow for LLM and agent outages: start from a failing run ID or thread, use the UI or LangSmith MCP tools (`fetch_runs`, `get_thread_history`) to reconstruct prompts, tool calls, and errors, then narrow scope with documented filters (run_type, is_root, FQL `filter` / `trace_filter` / `tree_filter`) before proposing code or prompt changes. The playbook cites official pagination rules (character-budget pages with `page_number` and `total_pages`) so investigators do not assume single-shot dumps, and it reminds teams to separate Cloud OAuth Remote MCP paths from self-hosted `LANGSMITH_ENDPOINT` configurations when collecting evidence.

Category Debugging

Platform LangSmith / LangChain

Published 2026-05-19

langsmithtracingdebugging

Use cases

On-call receives a spike in 5xx or empty completions from a RAG route backed by LangChain
Product reports a single customer thread where the assistant contradicted policy mid-conversation
Release candidate shows higher p95 latency after a prompt swap without obvious infra regression
Security review asks for evidence that tool calls stayed within approved scopes during an agent run
Finance questions trace billing usage after a marketing campaign drives traffic

Key features

Capture identifiers: project name, run UUID, thread ID, deployment version, and approximate timestamp window from the ticket.
Pull the root run with `fetch_runs` (set `is_root`, pass `limit`, and use documented FQL filters) or open the equivalent trace in the LangSmith UI.
If payloads truncate, page with `get_thread_history` or `fetch_runs` + `trace_id`, incrementing `page_number` until `total_pages` is exhausted.
Map the failure to a layer: retrieval miss, tool schema mismatch, model refusal, rate limit, or downstream HTTP error—cite child spans rather than guessing.
Compare against the last known-good prompt revision via `get_prompt_by_name` or prompt hub history before editing production templates.
Record mitigations, owners, and whether a dataset example or online eval should guard the regression per LangSmith evaluation docs.

When to Use This Skill

Production LLM regressions where unstructured application logs lack prompt/tool detail
Handoffs between support and engineering that already standardize on LangSmith projects
Postmortems requiring trace-backed timelines instead of chat screenshots

Expected Output

An incident note with run links, paginated evidence excerpts, hypothesized root cause tied to specific spans, and a verification plan (eval dataset or canary) before closing.

Frequently Asked Questions

Do we need the MCP server for every investigation?: No—the UI suffices for many cases; MCP accelerates agent-assisted triage when Cursor or Claude Code already has LangSmith credentials configured per docs.
Why emphasize character pagination?: LangSmith MCP docs cap page payloads by character budget (~25k default) to keep assistants within context limits—skipping pages loses tool outputs.
Does this replace OWASP or threat modeling skills?: No—it focuses on observability forensics; pair with security skills when investigating prompt injection or data exfiltration hypotheses.

3 Indexed items

Production debugging

Debugging

Diagnoses live production incidents using log triage, metric spike correlation, deploy window filtering, and safe reproduction steps without causing further disruption. Production debugging applies systematic debugging principles in a live environment where the cost of wrong actions is high and the ability to reproduce the issue is limited.

Agentic AI orchestration efficiency claims due diligence

Operations

Turns CEO and vendor narratives about agentic AI efficiency into a procurement and strategy checklist. The workflow separates quoted efficiency metrics (for example token- or energy-per-user framing) from product launch facts, orchestration architecture claims, and third-party valuation context in the same article. It references CNBC reporting on June 3, 2026 that Perplexity CEO Aravind Srinivas told CNBC's Elaine Yu the long-term AI winner will maximize what he called the "most taken value per watt per user" by balancing accuracy, latency, cost, privacy, and intelligence; that Perplexity is emphasizing agentic orchestration with Perplexity Computer (announced February) and Personal Computer on Windows (announced the prior Tuesday, with Mac already available); that Srinivas said Personal Computer routes processing between device and cloud; that Perplexity was last reportedly valued at $20 billion versus Anthropic near $1 trillion and OpenAI just over $850 billion with Anthropic confidentially filing for a U.S. IPO that week; and that Srinivas cited tripled annualized revenue since the start of the year tied to integrated Anthropic model improvements—without treating media valuations or CEO efficiency slogans as internal benchmarks.

Corporate AI token spend claims due diligence

Operations

Turns headlines about corporate AI token budgets into a finance and procurement checklist. The workflow separates fundraising valuation narratives from operational metrics CFOs can verify: provider-level token bills, model-mix efficiency, team attribution, and whether frontier models are used for low-value tasks. It references CNBC reporting on June 4, 2026 that Ramp raised $750 million at a $44 billion valuation led by ICONIQ, GIC, and Ontario Teachers' Pension Plan (~38% step-up), crossed $1 billion in annualized revenue with positive free cash flow per CEO Eric Glyman, serves 70,000 businesses, and is growing partly because clients need to rein in AI spending; Glyman said tokens are a new third pillar of spend, most CFOs did not plan for steep growth, Ramp customers spending the most revenue share on AI grew revenue 12% versus flat for the lowest spenders, and Glyman called the era of tokenmaxxing nearing its end—without treating media quotes as internal budget approvals.

LangSmith production trace investigation playbook

Use cases

Key features

When to Use This Skill

Expected Output

Frequently Asked Questions

Related

Production debugging

Agentic AI orchestration efficiency claims due diligence

Corporate AI token spend claims due diligence

Related news