Containerized inference microservices with OpenAI-compatible APIs for LLMs and more
NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.
Use cases
- Swap OpenAI client `base_url` to a local NIM container for on-prem LLM inference
- Deploy Kubernetes-scaled NIM microservices with Prometheus metrics
- Prototype models on NVIDIA-hosted endpoints before self-hosting under AI Enterprise
- Run Anthropic-style `/v1/messages` against a NIM LLM container
- Pick TRT-LLM or vLLM engines per GPU infrastructure per deployment guides
Key features
- OpenAI-compatible chat, completion, and responses endpoints per NIM LLM reference
- Anthropic-compatible `/v1/messages` routing through vLLM backend per docs
- Self-hosted containers with `/v1/health/live` and `/v1/health/ready` probes
- Catalog spans LLMs, vision, speech, and other use cases on docs.api.nvidia.com/nim
- Hosted NIM APIs via Developer Program for unlimited prototyping (FAQ)
Who Is It For?
- ML engineers shipping GPU inference on NVIDIA hardware
- Platform teams standardizing on OpenAI-compatible internal gateways
- Developers evaluating self-host vs hosted NIM endpoints
Frequently Asked Questions
- How is NIM different from calling NVIDIA cloud APIs directly?
- NIM packages optimized inference containers and standard APIs; you can self-host the same microservice pattern or use hosted dev endpoints per NVIDIA docs.
- Which API style should I use?
- Docs list OpenAI-compatible `/v1/chat/completions` for most chat apps and `/v1/messages` when you need Anthropic-compatible clients.
- Do I need NVIDIA AI Enterprise?
- Developer Program covers prototyping on hosted APIs; production self-host typically requires NVIDIA AI Enterprise licensing per product pages.
Related
Related
3 Indexed items
Baseten
Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.
AssemblyAI
AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.
fal
fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.