Skills / Category

Operations

Browse skills related to Operations.

Multi-region LLM provider readiness review

Operations

Converts export-control and multi-vendor routing guidance into a planning checklist for teams that cannot assume a single geography or chip supplier will stay available. Practitioners document primary and contingency model routes (including gateways such as Helicone or LiteLLM Router configs), quantify revenue or latency exposure if a region is blocked, and set investor/customer messaging when leadership advises to "expect nothing" from a market—as publicly reported when semiconductor vendors discuss China licensing uncertainty. The skill cross-checks legal/compliance sign-off, drills failover to alternate regions or domestic stacks, and records evidence before production launches tied to geopolitically sensitive deployments.

LiteLLM Router fallback readiness review

Operations

Translates LiteLLM routing documentation into a pre-flight checklist before promoting multi-deployment LLM routes to production. Teams verify Router configuration covers primary and fallback model lists, retry policies, and load-balancing strategy documented at docs.litellm.ai/docs/routing, confirm proxy virtual keys and spend limits if traffic flows through LiteLLM Proxy, and rehearse provider outage drills using OpenAI-mapped exceptions (AuthenticationError, RateLimitError, APIError). The skill also points operators to enable `store_model_in_db` when MCP tools must persist alongside router definitions and to validate MCP server names comply with SEP-986 guidance referenced in LiteLLM v1.80.18 release notes.

Postmortem trigger and root-cause taxonomy

Operations

Distills Appendix C (“Results of Postmortem Analysis”) from Google’s SRE workbook: it explains why Google catalogs standardized postmortem fields—linking outages to observable triggers versus deeper root-cause categories—so reliability leaders can prioritize systemic fixes rather than anecdotal fixes. The appendix cites a multi-year corpus (labeled 2010–2017 in the workbook) highlighting that binary pushes accounted for roughly 37% of outage triggers while configuration pushes were about 31%, with additional slices for user-behavior spikes, pipelines, upstream providers, performance decay, capacity, and hardware. A companion table correlates outages with qualitative root causes such as faulty software (~41%), development-process gaps (~20%), emergent complexity (~17%), deployment planning weaknesses (~7%), and network failures (~3%). Teams use these distributions to sanity-check whether their incident queues skew differently and to steer investment into the failure classes that statistically dominate historically.

Example SLO document authoring

Operations

Operationalizes Appendix A from Google’s SRE workbook by translating the illustrative “Example Game Service” SLO dossier into a checklist teams can mimic: articulate the user-facing workload, nominate rolling measurement windows (the appendix uses four weeks), pair each subsystem with tightly defined SLIs (availability from load balancers excluding 5xx, latency percentile gates, freshness for derived tables, correctness via probers, completeness for pipelines), cite explicit numerator/denominator language, rationalize rounding policies, quantify per-objective error budgets, and cite the sibling error budget policy for enforcement.

Error budget policy drafting

Operations

Translates Google’s worked example error-budget policy into a repeatable playbook for tying release tempo to measured reliability: define goals (protect users from repeated SLO misses while preserving innovation incentives), spell out what happens when the rolling window consumes its budget (freeze changes except urgent defects or security work), codify outage investigation thresholds, and document escalation paths when stakeholders disagree about budget math.

Incident response

Operations

Structured process for handling production incidents from detection to resolution and post-mortem. Covers severity assessment using P0-P3 grading, team coordination with a designated incident commander, communication templates for stakeholders and users, and structured post-mortem requirements to drive organizational learning from every significant outage.

SEO audit for web properties

Operations

Diagnoses indexing, crawlability, and on-page SEO issues across an entire site using automated crawls, Lighthouse checks, and structured output. An SEO audit surfaces actionable findings ranked by priority before manual review, making it possible to address critical issues quickly rather than discovering them through traffic drops.

Evaluation and benchmarking

Operations

Builds evaluation suites with ground-truth answers, automated scoring, and regression detection so you can measure whether model or prompt changes actually improve outcomes before shipping. Without systematic evaluation, teams ship changes that seem better anecdotally but may degrade specific edge cases silently.

AI cost optimization

Operations

Audits token usage, model selection, caching strategy, and prompt compression to prevent runaway inference costs as AI features scale. This is especially important for high-volume agentic workflows where repeated calls compound quickly, and where the gap between a well-optimized and a careless implementation can be orders of magnitude in cost.

Observability baselines

Operations

Establishes golden signals (latency, traffic, errors, saturation), SLO windows, and dashboard checks before agents automate deployments so that 'healthy' and 'degraded' have measurable definitions rather than subjective interpretations. This is essential when AI agents are managing deploys because agents need objective metrics to make decisions, not human gut feelings.

Postmortem writing

Operations

Captures the full incident timeline, blast radius, contributing factors, and concrete follow-up actions after production incidents so teams build institutional memory rather than repeating the same surprises. A well-written postmortem separates root cause from triggers, avoids blame, and produces tracked action items that prevent recurrence.

Canary rollouts

Operations

Deploys a new version to a small percentage of production traffic first, monitors error budgets and latency against baseline, and automatically widens or rolls back based on pre-defined criteria. This keeps the blast radius of a bad deployment small—particularly important when AI agents are modifying deployment pipelines where a single bad command could affect many users.

Structured logging

Operations

Defines a consistent set of log fields—request ID, user ID, feature flag, latency bucket, error code—so production debugging does not rely on grep across inconsistent printf-style strings. Structured JSON or key=value logging enables dashboards, alerts, and log aggregation tools to parse and query logs programmatically rather than through manual text searching.

Performance profiling

Operations

Finds genuine performance bottlenecks using CPU profiles, flame graphs, memory traces, and system metrics under realistic load before rewriting code. This prevents the common anti-pattern of spending days optimizing code paths that are not in the critical path, based on intuition rather than measurement.

Content refresh

Operations

Runs a scheduled audit of existing tool, MCP, skill, and news entries to identify and address stale pricing, broken documentation links, outdated capabilities, and weakened prose that quietly degrades directory quality. This maintenance rhythm prevents the directory from accumulating digital rot as tools evolve and entries grow outdated.

SEO indexing check

Operations

Reviews sitemap completeness, canonical URL configuration, hreflang pairing for bilingual sites, robots.txt directives, and Search Console signals before publishing a content batch. This is especially important for bilingual static sites where indexing misconfigurations can cause search engines to index the wrong locale or deprioritize pages unfairly.

Git worktrees for isolation

Operations

Uses Git worktrees to create isolated working directories attached to the same repository, each on a different branch, so parallel experiments or long-running tasks do not interfere with the main working tree or require repeated stash-and-reapply cycles. This is especially useful when one branch requires a heavy build or test run while work continues on another.

Systematic debugging

Operations

Replaces trial-and-error debugging with a hypothesis-driven process: state a falsifiable hypothesis, construct the smallest possible reproduction, and verify evidence before touching code. This structured approach is most valuable during production incidents, flaky CI builds, and confusing regressions where intuition-led debugging wastes hours on correlated but non-causal symptoms.

Finishing a development branch

Operations

Systematically closes out a development branch by running verification, cleaning up the commit history, pushing with proper tracking, and making an explicit choice between merge, squash, or follow-up tickets. This prevents the common pattern of abandoned branches, stale PRs, and lost context when work is not deliberately concluded.

Verify before you ship

Operations

Runs the minimal set of checks—tests, builds, manual verifications, or environment-specific validations—that confirm a task is truly complete before it is marked done. This practice prevents the common pattern where 'done' means 'written' rather than 'working in production,' and it creates a shared definition of completion across the team.