Speech-to-text, streaming transcription, Voice Agent API, and LLM Gateway for voice workflows

AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.

Category Developer Tools

Pricing Pay-as-you-go per audio hour; enterprise plans (see assemblyai.com/pricing)

Platforms Web / API / JavaScript / Python

speech-to-textstreamingvoice-agents

Use cases

Batch transcribe podcasts or calls with Universal-3 Pro and store transcript IDs
Build live captions or agent-assist with streaming STT
Run voice agents that need both STT and downstream LLM summarization
Redact PII or moderate content with Guardrails on transcript pipelines
Serve EU customers via `api.eu.assemblyai.com` residency endpoint

Key features

Pre-recorded STT via `POST /v2/transcript` with required `speech_models` parameter
Streaming STT WebSocket at `wss://streaming.assemblyai.com` with API-key auth
Voice Agent API for speech-to-speech agents over a single WebSocket
Speech Understanding suite (diarization, sentiment, topics, auto chapters)
LLM Gateway to apply Anthropic, OpenAI, Google, and other models to transcripts

Who Is It For?

Developers shipping voice-enabled SaaS products
Teams needing both async and real-time transcription in one vendor
ML engineers applying LLMs to spoken-data workflows

Frequently Asked Questions

Is there a default speech model?: No—AssemblyAI docs state every pre-recorded request must include `speech_models`; omitting it fails the request.
How do I authenticate API calls?: Pass your API key in the `Authorization` header for REST and LLM Gateway; streaming accepts the key as a query parameter or in the initial WebSocket message per docs.
What is the recommended model stack?: Docs recommend `universal-3-pro` for accuracy and suggest `['universal-3-pro','universal-2']` when you need automatic fallback for unsupported languages.

3 Indexed items

Deepgram

Developer ToolsPay-as-you-go per aud…

Deepgram documents speech-to-text at developers.deepgram.com with WebSocket streaming on `/v1/listen` for general real-time transcription (Nova-3 model, diarization, and search features per API reference) and `/v2/listen` for conversational Flux models with integrated end-of-turn detection (StartOfTurn, EndOfTurn, EagerEndOfTurn events). Official SDKs expose `deepgram.listen.v1.connect` and `deepgram.listen.v2.connect` for binary audio streams. Docs contrast Flux—optimized for voice agents with lower turn-detection latency—against Nova-3 for meetings, IVR, and agent-assist workloads, and describe latency measurement guides targeting sub-300 ms streaming for Nova-3. Self-hosted deployments can run Flux on dedicated Engine nodes with `/v2/listen` enabled per self-hosted configuration guides.

Fireworks AI

Developer ToolsServerless per-token …

Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.

NVIDIA NIM

Developer ToolsDeveloper Program hos…

NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.

AssemblyAI