Speech-to-text, streaming transcription, Voice Agent API, and LLM Gateway for voice workflows
AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.
Use cases
- Batch transcribe podcasts or calls with Universal-3 Pro and store transcript IDs
- Build live captions or agent-assist with streaming STT
- Run voice agents that need both STT and downstream LLM summarization
- Redact PII or moderate content with Guardrails on transcript pipelines
- Serve EU customers via `api.eu.assemblyai.com` residency endpoint
Key features
- Pre-recorded STT via `POST /v2/transcript` with required `speech_models` parameter
- Streaming STT WebSocket at `wss://streaming.assemblyai.com` with API-key auth
- Voice Agent API for speech-to-speech agents over a single WebSocket
- Speech Understanding suite (diarization, sentiment, topics, auto chapters)
- LLM Gateway to apply Anthropic, OpenAI, Google, and other models to transcripts
Who Is It For?
- Developers shipping voice-enabled SaaS products
- Teams needing both async and real-time transcription in one vendor
- ML engineers applying LLMs to spoken-data workflows
Frequently Asked Questions
- Is there a default speech model?
- No—AssemblyAI docs state every pre-recorded request must include `speech_models`; omitting it fails the request.
- How do I authenticate API calls?
- Pass your API key in the `Authorization` header for REST and LLM Gateway; streaming accepts the key as a query parameter or in the initial WebSocket message per docs.
- What is the recommended model stack?
- Docs recommend `universal-3-pro` for accuracy and suggest `['universal-3-pro','universal-2']` when you need automatic fallback for unsupported languages.
Related
Related
3 Indexed items
Deepgram
Deepgram documents speech-to-text at developers.deepgram.com with WebSocket streaming on `/v1/listen` for general real-time transcription (Nova-3 model, diarization, and search features per API reference) and `/v2/listen` for conversational Flux models with integrated end-of-turn detection (StartOfTurn, EndOfTurn, EagerEndOfTurn events). Official SDKs expose `deepgram.listen.v1.connect` and `deepgram.listen.v2.connect` for binary audio streams. Docs contrast Flux—optimized for voice agents with lower turn-detection latency—against Nova-3 for meetings, IVR, and agent-assist workloads, and describe latency measurement guides targeting sub-300 ms streaming for Nova-3. Self-hosted deployments can run Flux on dedicated Engine nodes with `/v2/listen` enabled per self-hosted configuration guides.
Fireworks AI
Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.
Baseten
Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.