Skills

Translates LiteLLM routing documentation into a pre-flight checklist before promoting multi-deployment LLM routes to production. Teams verify Router configuration covers primary and fallback model lists, retry policies, and load-balancing strategy documented at docs.litellm.ai/docs/routing, confirm proxy virtual keys and spend limits if traffic flows through LiteLLM Proxy, and rehearse provider outage drills using OpenAI-mapped exceptions (AuthenticationError, RateLimitError, APIError). The skill also points operators to enable `store_model_in_db` when MCP tools must persist alongside router definitions and to validate MCP server names comply with SEP-986 guidance referenced in LiteLLM v1.80.18 release notes.

LangSmith production trace investigation playbook

Debugging

Turns LangSmith observability documentation into a repeatable incident workflow for LLM and agent outages: start from a failing run ID or thread, use the UI or LangSmith MCP tools (`fetch_runs`, `get_thread_history`) to reconstruct prompts, tool calls, and errors, then narrow scope with documented filters (run_type, is_root, FQL `filter` / `trace_filter` / `tree_filter`) before proposing code or prompt changes. The playbook cites official pagination rules (character-budget pages with `page_number` and `total_pages`) so investigators do not assume single-shot dumps, and it reminds teams to separate Cloud OAuth Remote MCP paths from self-hosted `LANGSMITH_ENDPOINT` configurations when collecting evidence.

OWASP GenAI LLM Top 10 (v1.1) threat review checklist

Security

Maps the authoritative OWASP "Top 10 for Large Language Model Applications" (version 1.1) taxonomy—LLM01 Prompt Injection through LLM10 Model Theft—into an actionable readiness checklist for architects red-teaming Retrieval-Augmented Generation, Agents, plugins, training pipelines, or hosted inference gateways. Official project pages summarize each risk bucket (prompt injection bypassing safeguards, unchecked outputs enabling downstream exploits, poisoned corpora distorting reasoning, abusive workloads starving capacity, brittle supply-chain dependencies, sensitive data resurfacing inside generations, excessively privileged plugins/agents/autonomy, misplaced trust producing compliance failures, loss of proprietary model weights via API abuse). The skill pairs each category with tangible controls (policy, monitoring, toolchain limits) anchored to genai.owasp.org releases rather than anecdotes.

Postmortem trigger and root-cause taxonomy

Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.

Example SLO document authoring

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

Error budget policy drafting

Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.

NIST AI Risk Management Framework (AI RMF 1.0) lifecycle checklist

Planning

Anchors facilitation workshops to NIST's voluntary Artificial Intelligence Risk Management Framework (AI RMF 1.0, formally NIST.AI.100-1 with DOI https://doi.org/10.6028/NIST.AI.100-1): the playbook issued alongside the Framework emphasizes structuring programs around the mutually reinforcing core functions GOVERN → MAP → MEASURE → MANAGE rather than improvising unrelated security tickets. NIST contemporaneously publishes companion assets such as the Trustworthy AI Resource Center playbook (airc.nist.gov), roadmap, crosswalks, and—for generative workloads—the Generative Artificial Intelligence Profile (NIST AI 600-1, July 26, 2024, DOI https://doi.org/10.6028/NIST.AI.600-1)—so teams can reconcile novel failure modes against documented categories of trustworthiness. This operational skill folds those authoritative layers into scripted prompts for cross-functional councils that must evidence documentation, escalation paths, quantitative trustworthiness analyses, prioritized mitigations, and alignment with externally referenced stakeholder expectations—not marketing slides.

Creating and maintaining Cursor skills

Defines how to author, revise, and validate SKILL.md files so agent skills stay executable, scoped, and testable. It focuses on turning vague know-how into reusable operational instructions with clear triggers, deterministic steps, and verification checks.

Designing with LLM structured outputs

This skill covers when and how to ask an LLM for machine-readable payloads: define a JSON Schema (or the vendor's equivalent), enable the structured-output feature your provider documents, validate responses in application code, and handle refusals or validation errors explicitly. It applies to tool-calling agents, extraction pipelines, configuration emitters, and any workflow where brittle text parsing creates production risk.

Maintaining Cursor Project Rules

Follow Cursor's official Rules documentation when you want persistent Agent guidance tied to a repository. Project rules encode architecture expectations, risky-folder guardrails, or repeatable workflows; Cursor applies them via Always Apply, intelligent relevance, glob-scoped attachments, or manual @mentions. Use .mdc frontmatter for finer control and reference templates with @file instead of pasting large snippets.

Structured AI meeting notes

Converts raw meeting transcripts into structured, actionable notes with decision logs, assigned action items, and key context preserved for future AI retrieval. This skill bridges the gap between what was discussed in a meeting and what AI agents need to know when acting on outcomes days or weeks later.

Incident response

Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.

Context-Aware QA Skill

Context-Aware QA is a prompting technique where an AI model is instructed to retrieve and cite authoritative sources before answering factual questions. By combining retrieval-augmented generation (RAG) with explicit verification instructions, it dramatically reduces hallucinations in production AI systems.

Production debugging

Debugging

Diagnoses live production incidents using log triage, metric spike correlation, deploy window filtering, and safe reproduction steps without causing further disruption. Production debugging applies systematic debugging principles in a live environment where the cost of wrong actions is high and the ability to reproduce the issue is limited.

Safe dependency upgrades

Maintenance

A structured checklist for upgrading npm, pip, Cargo, or similar dependency managers without breaking production. This covers changelog analysis, semver risk assessment, lockfile handling, and smoke testing so that routine dependency updates do not become sources of production incidents.

RAG pipeline construction

Builds production-ready retrieval-augmented generation pipelines with deliberate chunking strategies, embedding model selection, vector store configuration, hybrid search blending, and reranking so agents answer from your documents with reduced hallucination and cited sources. This skill focuses on the engineering decisions that separate a working prototype from a production-quality RAG system.

Multi-agent handoff design

Planning

Designs clean handoff protocols between specialized agents so work passes between planner, coder, reviewer, and executor agents without losing context, creating circular dependencies, or introducing race conditions. Handoff design treats agent-to-agent communication as an API contract with versioning, error handling, and explicit acknowledgment requirements.

Documentation from code

Extracts architecture decisions, API contracts, and usage patterns directly from code to produce accurate documentation that stays in sync with implementation. Documentation-from-code treats code as the source of truth and generates prose from it rather than maintaining documentation as a separate artifact that diverges over time.

SEO audit for web properties

Diagnoses indexing, crawlability, and on-page SEO issues across an entire site using automated crawls, Lighthouse checks, and structured output. An SEO audit surfaces actionable findings ranked by priority before manual review, making it possible to address critical issues quickly rather than discovering them through traffic drops.

Agentic workflow design

Planning

Structures multi-step agent tasks with explicit inputs, outputs, fallback behavior, and handoff protocols so agents reliably complete complex workflows instead of stopping at the first blocker. Agentic workflow design applies software engineering discipline to AI agent pipelines, treating each step as a function with typed inputs and outputs.

Codebase indexing

Builds and maintains semantic indexes of a codebase so AI coding assistants can retrieve relevant context—file relationships, symbol usage, historical decisions—without re-parsing the entire codebase on every query. Codebase indexing is essential for large codebases where context window limits prevent feeding the entire codebase to the model.

AI product requirement writing

Writes product requirements documents that AI agents can act on reliably, with explicit constraints, edge cases, and acceptance criteria that minimize the gap between what you mean and what the agent builds. This skill bridges the ambiguity of natural language product specs and the precision that AI agents require to produce consistent results.

Security review for AI-generated code

Security

Reviews AI-generated code for security failure modes that AI assistants commonly miss: prompt injection risks, credential exposure, dependency vulnerabilities, insecure deserialization, and access control gaps. This skill catches what agents miss when they optimize for functionality over safety, especially in code that handles user input, authentication, or external data.

Fine-tuning preparation

Curates, deduplicates, and formats training datasets for fine-tuning so that the resulting model actually improves on target behaviors rather than learning noise. Fine-tuning preparation covers dataset quality filtering, output format consistency, train/test splits, and avoiding common pitfalls like data leakage that invalidate fine-tuning results.

Evaluation and benchmarking

Builds evaluation suites with ground-truth answers, automated scoring, and regression detection so you can measure whether model or prompt changes actually improve outcomes before shipping. Without systematic evaluation, teams ship changes that seem better anecdotally but may degrade specific edge cases silently.

Multi-agent orchestration

Coordinates multiple AI agents on shared tasks with explicit handoff protocols, shared state management, and conflict resolution so parallel work stays coherent. Multi-agent orchestration is more structured than simple parallel dispatch because agents take on distinct roles with explicit dependencies rather than running identical briefs on independent data.

AI cost optimization

Audits token usage, model selection, caching strategy, and prompt compression to prevent runaway inference costs as AI features scale. This is especially important for high-volume agentic workflows where repeated calls compound quickly, and where the gap between a well-optimized and a careless implementation can be orders of magnitude in cost.

Prompt engineering

Crafts prompts with explicit task framing, role definition, output constraints, citation requirements, and few-shot examples so model responses are consistent, grounded in evidence, and actionable for downstream tasks. Prompt engineering reduces the variability and hallucination risk that comes from under-specified prompts.

RAG implementation

Builds retrieval-augmented generation pipelines that ground model responses in your own documents rather than generic training knowledge. A RAG implementation covers document ingestion, semantic chunking, embedding, vector storage, hybrid search, reranking, and answer synthesis—so assistants answer from your data with cited sources.

Observability baselines

Establishes golden signals (latency, traffic, errors, saturation), SLO windows, and dashboard checks before agents automate deployments so that 'healthy' and 'degraded' have measurable definitions rather than subjective interpretations. This is essential when AI agents are managing deploys because agents need objective metrics to make decisions, not human gut feelings.

Postmortem writing

Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.

Library docs in the loop

Keeps AI assistant answers anchored to the actual library documentation, changelog, and typed signatures that are shipped rather than to memory or stale blog summaries. This is essential during major version bumps, unfamiliar SDK integration, or on-call hotfixes where confident but incorrect guesses about API behavior cause more damage than the original bug.

Contract testing

Locks API expectations between services using consumer-driven contracts so that when one team changes their implementation, it fails in CI rather than during a coordinated production deployment. Contract testing prevents the common integration failure pattern where both sides of an API appear to work in isolation but break when connected in production.

Canary rollouts

Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.

Structured logging

Defines a consistent set of log fields—request ID, user ID, feature flag, latency bucket, error code—so production debugging does not rely on grep across inconsistent printf-style strings. Structured JSON or key=value logging enables dashboards, alerts, and log aggregation tools to parse and query logs programmatically rather than through manual text searching.

Threat modeling

Systematically identifies threats to a system by mapping data flows, defining trust boundaries, and enumerating adversaries and misuse cases before shipping. This produces a security-focused diagram and prioritized mitigation list that makes subsequent security reviews faster and more substantive than starting from a generic checklist.

Safe refactoring

Executes refactoring changes in small, test-backed steps so behavior is preserved while structure improves. Each refactoring operation—rename, extract, inline, move—is validated by the test suite before proceeding to the next, preventing the common pattern of refactoring into subtle behavioral regressions that are only caught in production.

Humanizer

Removes the common AI-generated writing patterns—significance inflation, filler -ing constructions, em-dash chains, and formulaic closers—that make machine-generated prose feel generic or overproduced. Runs a final 'still obviously AI?' audit pass before shipping any prose intended for human readers.

Performance profiling

Finds genuine performance bottlenecks using CPU profiles, flame graphs, memory traces, and system metrics under realistic load before rewriting code. This prevents the common anti-pattern of spending days optimizing code paths that are not in the critical path, based on intuition rather than measurement.

Chinese Humanizer

Tightens Chinese drafts by removing translationese, slogan-like endings, stacked abstractions, and stiff AI rhythm while preserving factual accuracy. This addresses the specific failure modes of machine-translated or AI-generated Chinese text: word-for-word English structures, Western rhetorical patterns that feel unnatural to Chinese readers, and filler phrases that add length without meaning.

Source verification

Checks whether a claim is backed by a primary source, a current official page, or a reputable secondary source before that claim becomes published copy. This skill is essential for AI tool directories, MCP server listings, and news summaries where accuracy and trustworthiness directly affect reader decisions and SEO credibility.

Content refresh

Runs a scheduled audit of existing tool, MCP, skill, and news entries to identify and address stale pricing, broken documentation links, outdated capabilities, and weakened prose that quietly degrades directory quality. This maintenance rhythm prevents the directory from accumulating digital rot as tools evolve and entries grow outdated.

SEO indexing check

Reviews sitemap completeness, canonical URL configuration, hreflang pairing for bilingual sites, robots.txt directives, and Search Console signals before publishing a content batch. This is especially important for bilingual static sites where indexing misconfigurations can cause search engines to index the wrong locale or deprioritize pages unfairly.

API design and versioning

Shapes REST or RPC API surfaces with consistent resource modeling, predictable error responses, paginated list endpoints, and an explicit deprecation policy before implementation locks you into contracts that are costly to change. Good API design prevents client breakage, reduces support burden, and makes feature additions less disruptive.

Requesting code review

Frames a pull request so reviewers understand the risk profile, what has been tested, and where to focus their limited attention. This produces faster, more useful reviews because reviewers spend less time reconstructing context and more time evaluating the actual changes.

Executing implementation plans

Executes a pre-written implementation plan in disciplined order, stopping at defined checkpoints to verify assumptions before moving forward. This skill prevents the common pattern of diverging from the plan silently when reality proves it wrong, and it creates natural opportunities to course-correct before small errors compound into large rework.

Writing implementation plans

Converts vague or frozen requirements into precise, step-by-step implementation plans with file-level touchpoints, decision checkpoints, and verifiable acceptance criteria before any code is written. This bridges the gap between what stakeholders want and what engineers can actually ship, reducing mid-sprint surprises and wasted refactors.

Git worktrees for isolation

Uses Git worktrees to create isolated working directories attached to the same repository, each on a different branch, so parallel experiments or long-running tasks do not interfere with the main working tree or require repeated stash-and-reapply cycles. This is especially useful when one branch requires a heavy build or test run while work continues on another.

Test-driven development

Drives development through red-green-refactor cycles where you write a failing test that names the desired behavior before writing any implementation code. TDD produces tests that document intent, catches regressions immediately, and forces small, verifiable increments—making it especially valuable for complex features, bug fixes with known failure cases, and any code that needs a long-term safety net.

Dispatching parallel agents

Distributes embarrassingly parallel work across multiple AI agents with clear briefs and crisp handoff protocols, then aggregates their results through a single integrator. This technique maximizes throughput when tasks are independent and the coordination overhead is low, making it ideal for research chunks, file batches, or parallel data processing.

Systematic debugging

Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.

Subagent-driven development

Coordinates multiple AI subagents on slices of a larger plan where each subagent handles a defined scope while a single parent agent retains accountability for integration, quality, and final delivery. This approach is valuable when a single agent working sequentially would be too slow, but you still need coherent end-to-end quality rather than fragmented outputs.

Image generation

Design

Creates or edits bitmap artwork for covers, concept mockups, and rapid visual exploration when the deliverable requires photographic quality, complex textures, or artistic styles that are impractical to hand-code in SVG or CSS. Image generation accelerates the early design phase by producing concrete visual references before committing to a final style.

Finishing a development branch

Systematically closes out a development branch by running verification, cleaning up the commit history, pushing with proper tracking, and making an explicit choice between merge, squash, or follow-up tickets. This prevents the common pattern of abandoned branches, stale PRs, and lost context when work is not deliberately concluded.

Plugin scaffolding

Bootstraps a complete plugin project structure with manifest files, entry points, configuration schemas, and baseline tests so new Codex or editor extensions follow a consistent, reviewable template from day one. This eliminates the setup tax for creating new plugins and ensures every plugin in a codebase shares the same conventions for configuration, logging, and error handling.

Brainstorming before build

Explores goals, constraints, risks, and design options before committing to a specific implementation path. This technique is most valuable when facing product or UX decisions where the wrong choice is expensive to reverse—new features with uncertain user value, architectural pivots, or cross-functional dependencies where each team has a different mental model of the problem.

Frontend design

Design

Creates production-grade UI layouts and components with deliberate spacing, typography hierarchy, color application, and motion design so the interface communicates structure and state clearly. This skill is applied when building new UI sections, redesigning existing pages, or establishing component patterns that need to feel intentional and cohesive rather than defaults from a component library.

Verify before you ship

Runs the minimal set of checks—tests, builds, manual verifications, or environment-specific validations—that confirm a task is truly complete before it is marked done. This practice prevents the common pattern where 'done' means 'written' rather than 'working in production,' and it creates a shared definition of completion across the team.

OpenAI documentation lookup

Prioritizes official OpenAI documentation, model cards, and API references when researching integration details, model capabilities, or API behavior changes. This avoids the noise and staleness of third-party blog posts that may summarize older model versions or incomplete information.

Receiving code review