Serverless and dedicated inference for open and custom LLM, image, and embedding models

Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.

Category Developer Tools

Pricing Serverless per-token pricing on fireworks.ai/pricing; dedicated deployments billed per GPU-second

Platforms Web / API / CLI

inferencellmfine-tuning

Use cases

Run Llama and other open models without provisioning GPUs via serverless endpoints
Deploy private LoRA fine-tunes on dedicated hardware tied to a base model
Manage model lifecycle (upload, fine-tune, deploy) through Fireworks account APIs
Compare serverless latency/cost trade-offs before committing to dedicated capacity
Integrate image and embedding endpoints alongside text models through one API key

Key features

REST API with Authorization Bearer API keys documented in api-reference introduction
Serverless inference catalog for community models plus user-uploaded custom base models
Dedicated deployments supporting base models and LoRA addons per models overview
Fine-tuning, dataset, and deployment management APIs listed in Fireworks docs index
Data-privacy statement: no prompt/output logging by default aside from noted exceptions

Who Is It For?

ML engineers shipping fine-tuned open-weight models to production
Startups needing serverless LLM access with optional dedicated scaling path
Platform teams evaluating per-token vs GPU-second economics

Frequently Asked Questions

Can custom models use serverless inference?: Docs state user-provided models—including fine-tunes—require dedicated deployments; serverless is for Fireworks-predeployed catalog models.
Are prompts stored?: Fireworks docs say prompt/generated data is not logged by default except for FireFunction (30-day logging) or explicit opt-in features.
How do LoRA addons deploy?: LoRA addons must run on a dedicated deployment for their corresponding base model per the models overview.

3 Indexed items

Baseten

Developer ToolsUsage-based inference…

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

fal

Developer ToolsPer-second Serverless…

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.

AssemblyAI

Developer ToolsPay-as-you-go per aud…

AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.

Fireworks AI