F

AI Tool

Fireworks AI

Serverless and dedicated inference for open and custom LLM, image, and embedding models

Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.

Category Developer Tools
Pricing Serverless per-token pricing on fireworks.ai/pricing; dedicated deployments billed per GPU-second
Platforms Web / API / CLI
inferencellmfine-tuning

Use cases

  • Run Llama and other open models without provisioning GPUs via serverless endpoints
  • Deploy private LoRA fine-tunes on dedicated hardware tied to a base model
  • Manage model lifecycle (upload, fine-tune, deploy) through Fireworks account APIs
  • Compare serverless latency/cost trade-offs before committing to dedicated capacity
  • Integrate image and embedding endpoints alongside text models through one API key

Key features

  • REST API with Authorization Bearer API keys documented in api-reference introduction
  • Serverless inference catalog for community models plus user-uploaded custom base models
  • Dedicated deployments supporting base models and LoRA addons per models overview
  • Fine-tuning, dataset, and deployment management APIs listed in Fireworks docs index
  • Data-privacy statement: no prompt/output logging by default aside from noted exceptions

Who Is It For?

  • ML engineers shipping fine-tuned open-weight models to production
  • Startups needing serverless LLM access with optional dedicated scaling path
  • Platform teams evaluating per-token vs GPU-second economics

Frequently Asked Questions

Can custom models use serverless inference?
Docs state user-provided models—including fine-tunes—require dedicated deployments; serverless is for Fireworks-predeployed catalog models.
Are prompts stored?
Fireworks docs say prompt/generated data is not logged by default except for FireFunction (30-day logging) or explicit opt-in features.
How do LoRA addons deploy?
LoRA addons must run on a dedicated deployment for their corresponding base model per the models overview.

Related

Related

3 Indexed items

Together AI

Developer ToolsUsage-based inference + optional dedicated endpoints / fine-tuning (see Together pricing docs)

Together AI operates a developer platform for running prominent open-source and vendor-weight models from Together-hosted GPUs. Documentation centers on issuing API keys, installing the Together Python (`together`) or npm (`together-ai`) SDKs, or calling HTTPS endpoints such as `https://api.together.ai/v1/chat/completions` with Bearer authentication. Guides cover streaming chat completions, function calling, structured outputs, model catalog browsing, GPU reservations for steady traffic, and fine-tuning or dedicated cluster offerings published in the broader docs hierarchy.

Groq Cloud API

Developer ToolsFree tier + Pay-as-you-go (published USD rates)

GroqCloud exposes hosted language, speech, and compound workloads through Groq’s HTTP APIs. Documentation highlights compatibility with OpenAI client libraries when you point `base_url` at Groq’s OpenAI-compatible endpoint and supply a Groq API key, alongside first-party Groq SDKs for Python and JavaScript. Pricing pages publish per-model token rates (USD) for on-demand inference.

Replicate

Developer ToolsPay-per-prediction billing + prepaid credits (see Replicate billing docs)

Replicate is a hosted platform for executing third-party and custom machine-learning models over HTTP without provisioning GPUs yourself. Official documentation explains how to authenticate with API tokens, create asynchronous predictions, stream outputs, retrieve model metadata, wire webhooks for completion events, and optionally deploy or fine-tune checkpoints (for example FLUX image workflows) published to the Replicate catalog.