Serverless Python cloud for GPUs, sandboxes, batch jobs, and LLM inference

Modal documents a serverless cloud at modal.com where engineers run compute-intensive Python with zero infrastructure configuration: deploy OpenAI-compatible LLM services, batch workflows, job queues, GPU training and fine-tuning, and thousands of isolated Sandboxes for agent-generated code. Official guides show defining apps with `@app.function`, container images via `modal.Image`, and GPU types in code rather than YAML. Modal states pricing is per-second serverless usage with pooled capacity across major clouds, and supports calling functions from JavaScript/Go clients in addition to Python.

Category Developer Tools

Pricing Per-second serverless usage per modal.com/pricing

Platforms Web / Python / JavaScript / Go

serverlessgpuinference

Use cases

Serve open-weight LLMs with sub-second cold starts without managing Kubernetes
Run massively parallel batch inference or data processing jobs
Fine-tune diffusion or language models on latest GPUs via code-defined environments
Host coding agents in Sandboxes with LangGraph examples linked from docs
Prototype with `modal run` locally then scale to production serverless functions

Key features

Python `@app.function` deployments with programmatic GPU and image configuration per docs
Documented examples for LLM inference, batch processing, and real-time transcription
Sandboxes for secure execution of AI-generated code at scale
GPU-backed Notebooks launched in seconds per platform overview
Multi-cloud capacity pooling described in introduction guide

Who Is It For?

ML engineers who want GPU workloads without cluster operations
Agent builders needing isolated code execution environments
Teams shipping inference APIs without maintaining cloud infrastructure

Frequently Asked Questions

Do I need Docker or Kubernetes knowledge?: Modal docs emphasize code-defined images and functions with no YAML cluster config required for basic usage.
How do I get started?: Official flow: create modal.com account, `pip install modal`, run `modal setup` to authenticate, then `modal run` your script.
Is Modal only for Python authors?: Functions are authored in Python, but docs list JavaScript/Go SDKs to invoke Modal resources.

3 Indexed items

fal

Developer ToolsPer-second Serverless…

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.

RunPod

Developer ToolsPer-second serverless…

RunPod documents a serverless platform at docs.runpod.io where teams deploy containerized AI handlers without managing servers, paying only for compute time used. Developers write Python handler functions with the Runpod SDK (`runpod.serverless.start`), package Docker images, and expose queue-based endpoints at `https://api.runpod.ai/v2/{ENDPOINT_ID}/runsync` or `/run` with `Authorization: Bearer RUNPOD_API_KEY`. Docs cover streaming handlers, load-balancing endpoints with custom HTTP frameworks, Pods for persistent GPUs, network volumes, and a REST API at rest.runpod.io for programmatic resource management.

Baseten

Developer ToolsUsage-based inference…

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

Modal