Modal LLM Serving
Quick Start
- Verify the actual local Modal environment before writing code.
modal --version
python -c "import modal,sys; print(modal.__version__); print(sys.executable)"
modal profile current
- Do not assume the default
python interpreter matches the environment behind the modal CLI.
- Confirm that the request is really about self-hosted open-weight text generation on Modal.
- Standard online API
- Cold-start-sensitive vLLM deployment
- Low-latency interactive serving
- High-throughput or offline batch text inference
- Read references/performance-playbook.md and then exactly one primary reference.
- Default to vLLM plus
@modal.web_server unless the user explicitly optimizes for lowest latency or offline throughput.
- Ground every implementation in the actual workload: target latency or throughput, model size and precision, GPU type and count, region, concurrency target, and cold-start tolerance.
Choose the Workflow
- Use vLLM with
@modal.web_server for the default OpenAI-compatible serving path. Read references/vllm-online-serving.md.
- Use vLLM with memory snapshots and a sleep or wake flow only when cold-start latency is a first-class requirement. Read references/vllm-cold-starts.md.
- Use SGLang with
modal.experimental.http_server, explicit region selection, and sticky routing when the user cares most about latency. Read references/sglang-low-latency.md.
- Use the vLLM Python
LLM interface inside @app.cls or another batch worker when the task is about tokens per second or tokens per dollar rather than per-request HTTP behavior. Read references/vllm-throughput.md.
Default Rules
- Pin model revisions when pulling from Hugging Face or another mutable registry.
- Cache model weights in one Modal Volume and engine compilation artifacts in another when the runtime produces them.
- Set
HF_XET_HIGH_PERFORMANCE=1 for Hub downloads unless the environment has a specific reason not to.
- Include a readiness check before reporting success. Add a smoke-test path such as a
local_entrypoint or a small client that hits /health and one OpenAI-compatible request.
- Treat
max_inputs, target_inputs, tensor parallelism, and related knobs as workload-specific. Start conservative and benchmark before increasing them.
- Expose only the ports and endpoints the task actually needs.
- Keep one serving pattern per file unless the user explicitly asks for a comparison artifact or benchmark harness.
- Use SGLang only when lowest latency is the explicit objective and the extra setup is justified.
- Use the vLLM Python
LLM interface only for offline or batch inference that does not need per-request HTTP behavior.
- Use snapshot-based cold-start reduction only when startup latency matters enough to justify extra operational complexity.
- Keep the scope on self-hosted text generation engines. Do not stretch this skill to cover embeddings, generic
transformers pipelines, diffusion inference, or purely hosted-API usage.
- If the task is really about training or post-training, stop and use
modal-finetuning.
- If the task is really about detached job orchestration, retries, or
.map and .spawn, stop and use modal-batch-processing.
- If the task is really about isolated interactive execution or sandbox lifecycle, stop and use
modal-sandbox.
Validate
- Run
npx skills add . --list after editing the package metadata or skill descriptions.
- Keep
evals/evals.json and evals/trigger-evals.json aligned with the actual workflow boundary of the skill.
- Run
python3 -m py_compile skills/modal-llm-serving/scripts/qwen3_throughput.py when changing the throughput artifact.
References