Train, deploy, and serve models with Truss, Model APIs, and OpenAI-compatible endpoints
Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.
Use cases
- Ship a Hugging Face LLM to a GPU endpoint without writing Dockerfiles
- Prototype with Model APIs then promote a tuned checkpoint via Truss training flows
- Serve agent backends that already use the OpenAI SDK by swapping base URL and API key
- Run custom inference logic in `predict` while Baseten manages containers and scaling
- Benchmark TensorRT-LLM optimized builds against baseline PyTorch serving
Key features
- Truss `config.yaml` deployments for supported open LLMs, embeddings, and image models per build-your-first-model guide
- OpenAI-compatible HTTP APIs on engine-based deployments with documented `BASETEN_API_KEY` authentication
- Custom `model.py` Model lifecycle (`__init__`, `load`, `predict`) for preprocessing and unsupported architectures
- Development vs production promotion paths (`/development/predict` to `/production/predict`) documented in deployment guides
- Model APIs for zero-setup inference on catalog checkpoints without a private deployment
Who Is It For?
- ML engineers deploying open-weight models to production APIs
- Platform teams standardizing on Truss packaging for internal model catalogs
- Startups needing managed GPU inference without operating Kubernetes
Frequently Asked Questions
- Do I always need a custom model.py file?
- No. Baseten docs show config-only Truss deployments for many popular open architectures; custom Python is for unsupported engines or bespoke preprocessing.
- How do I authenticate API calls?
- Docs use `Authorization: Api-Key` headers with `BASETEN_API_KEY` for deployment endpoints and Model APIs.
- What is the difference between Model APIs and Truss deployments?
- Model APIs are hosted catalog endpoints you can call immediately; Truss deployments package your chosen model and hardware into a dedicated Baseten endpoint you manage.
Related
Related
3 Indexed items
Fireworks AI
Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.
Together AI
Together AI operates a developer platform for running prominent open-source and vendor-weight models from Together-hosted GPUs. Documentation centers on issuing API keys, installing the Together Python (`together`) or npm (`together-ai`) SDKs, or calling HTTPS endpoints such as `https://api.together.ai/v1/chat/completions` with Bearer authentication. Guides cover streaming chat completions, function calling, structured outputs, model catalog browsing, GPU reservations for steady traffic, and fine-tuning or dedicated cluster offerings published in the broader docs hierarchy.
Modal
Modal documents a serverless cloud at modal.com where engineers run compute-intensive Python with zero infrastructure configuration: deploy OpenAI-compatible LLM services, batch workflows, job queues, GPU training and fine-tuning, and thousands of isolated Sandboxes for agent-generated code. Official guides show defining apps with `@app.function`, container images via `modal.Image`, and GPU types in code rather than YAML. Modal states pricing is per-second serverless usage with pooled capacity across major clouds, and supports calling functions from JavaScript/Go clients in addition to Python.