Serverless Python cloud for GPUs, sandboxes, batch jobs, and LLM inference
Modal documents a serverless cloud at modal.com where engineers run compute-intensive Python with zero infrastructure configuration: deploy OpenAI-compatible LLM services, batch workflows, job queues, GPU training and fine-tuning, and thousands of isolated Sandboxes for agent-generated code. Official guides show defining apps with `@app.function`, container images via `modal.Image`, and GPU types in code rather than YAML. Modal states pricing is per-second serverless usage with pooled capacity across major clouds, and supports calling functions from JavaScript/Go clients in addition to Python.
Use cases
- Serve open-weight LLMs with sub-second cold starts without managing Kubernetes
- Run massively parallel batch inference or data processing jobs
- Fine-tune diffusion or language models on latest GPUs via code-defined environments
- Host coding agents in Sandboxes with LangGraph examples linked from docs
- Prototype with `modal run` locally then scale to production serverless functions
Key features
- Python `@app.function` deployments with programmatic GPU and image configuration per docs
- Documented examples for LLM inference, batch processing, and real-time transcription
- Sandboxes for secure execution of AI-generated code at scale
- GPU-backed Notebooks launched in seconds per platform overview
- Multi-cloud capacity pooling described in introduction guide
Who Is It For?
- ML engineers who want GPU workloads without cluster operations
- Agent builders needing isolated code execution environments
- Teams shipping inference APIs without maintaining cloud infrastructure
Frequently Asked Questions
- Do I need Docker or Kubernetes knowledge?
- Modal docs emphasize code-defined images and functions with no YAML cluster config required for basic usage.
- How do I get started?
- Official flow: create modal.com account, `pip install modal`, run `modal setup` to authenticate, then `modal run` your script.
- Is Modal only for Python authors?
- Functions are authored in Python, but docs list JavaScript/Go SDKs to invoke Modal resources.
Related
Related
3 Indexed items
Fireworks AI
Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.
Groq Cloud API
GroqCloud exposes hosted language, speech, and compound workloads through Groq’s HTTP APIs. Documentation highlights compatibility with OpenAI client libraries when you point `base_url` at Groq’s OpenAI-compatible endpoint and supply a Groq API key, alongside first-party Groq SDKs for Python and JavaScript. Pricing pages publish per-model token rates (USD) for on-demand inference.
Portkey
Portkey documents an AI gateway at docs.portkey.ai that unifies access to more than 250 models through a Portkey SDK or OpenAI-compatible base URL (`PORTKEY_GATEWAY_URL`) with provider routing headers. Official quickstarts show three-line Python or TypeScript integrations that start monitoring LLM requests for resilience, security, and performance. Portkey states the open-source gateway is free to self-host while the managed service includes a free tier of 10k requests per month, edge-hosted workers adding roughly 20–40ms latency versus direct API calls, ISO 27001 and SOC 2 certifications, and optional configurations that skip storing request/response bodies.