Containerized inference microservices with OpenAI-compatible APIs for LLMs and more

NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.

Category Developer Tools

Pricing Developer Program hosted APIs for prototyping; NVIDIA AI Enterprise for production self-host (see nvidia.com/nim)

Platforms Web / API / Docker / Kubernetes

inferencegpucontainers

Use cases

Swap OpenAI client `base_url` to a local NIM container for on-prem LLM inference
Deploy Kubernetes-scaled NIM microservices with Prometheus metrics
Prototype models on NVIDIA-hosted endpoints before self-hosting under AI Enterprise
Run Anthropic-style `/v1/messages` against a NIM LLM container
Pick TRT-LLM or vLLM engines per GPU infrastructure per deployment guides

Key features

OpenAI-compatible chat, completion, and responses endpoints per NIM LLM reference
Anthropic-compatible `/v1/messages` routing through vLLM backend per docs
Self-hosted containers with `/v1/health/live` and `/v1/health/ready` probes
Catalog spans LLMs, vision, speech, and other use cases on docs.api.nvidia.com/nim
Hosted NIM APIs via Developer Program for unlimited prototyping (FAQ)

Who Is It For?

ML engineers shipping GPU inference on NVIDIA hardware
Platform teams standardizing on OpenAI-compatible internal gateways
Developers evaluating self-host vs hosted NIM endpoints

Frequently Asked Questions

How is NIM different from calling NVIDIA cloud APIs directly?: NIM packages optimized inference containers and standard APIs; you can self-host the same microservice pattern or use hosted dev endpoints per NVIDIA docs.
Which API style should I use?: Docs list OpenAI-compatible `/v1/chat/completions` for most chat apps and `/v1/messages` when you need Anthropic-compatible clients.
Do I need NVIDIA AI Enterprise?: Developer Program covers prototyping on hosted APIs; production self-host typically requires NVIDIA AI Enterprise licensing per product pages.

3 Indexed items

CoreWeave

Developer ToolsUsage-based GPU infer…

CoreWeave documents inference products at docs.coreweave.com/products/inference spanning Serverless, Dedicated (BYOW on H100/B200/A100-class GPUs), and CKS options, all exposing OpenAI API-compatible endpoints per the inference introduction. The Inference API at api.coreweave.com (v1alpha1) manages gateways, deployments, and capacity claims over REST/JSON, gRPC, or Connect with Bearer tokens requiring Inference Viewer or Inference Admin roles. Getting-started guides walk through gateway creation with IAM authentication, body-based routing on the model field, and chat completion requests against deployed weights in CoreWeave Object Storage.

Baseten

Developer ToolsUsage-based inference…

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

AssemblyAI

Developer ToolsPay-as-you-go per aud…

AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.

NVIDIA NIM