vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
Building an OpenAI-compatible API endpoint for self-hosted models
Optimizing LLM throughput and latency for production traffic
Running multi-GPU inference with tensor or pipeline parallelism
Deploying quantized models to reduce GPU memory requirements

Prerequisites

NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
Docker or Python 3.9+ with pip
40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkit for Docker GPU passthrough

Quick Start

Install vLLM

pip install vllm

Serve a model (OpenAI-compatible API)

vllm serve meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0
--port 8000
--api-key your-secret-key

Test the endpoint

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer your-secret-key"
-d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'

Docker Deployment

docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model meta-llama/Llama-3.1-8B-Instruct
--api-key your-secret-key

Docker Compose (Production)

services: vllm: image: vllm/vllm-openai:latest runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} volumes: - model-cache:/root/.cache/huggingface ports: - "8000:8000" ipc: host command: > --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2 --max-model-len 32768 --gpu-memory-utilization 0.90 --api-key ${VLLM_API_KEY} restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3

volumes: model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

Split one model across 4 GPUs

vllm serve meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.90

Quantization (Lower VRAM)

AWQ quantization (70B on 2x A100 40GB)

vllm serve casperhansen/llama-3-70b-instruct-awq
--quantization awq
--tensor-parallel-size 2

GPTQ quantization

vllm serve TheBloke/Llama-2-70B-Chat-GPTQ
--quantization gptq

FP8 (H100 NVL native)

vllm serve meta-llama/Llama-3.1-405B-Instruct
--quantization fp8
--tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct
--enable-auto-tool-choice
--tool-call-parser llama3_json
--guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct
--enable-lora
--lora-modules sql-lora=/path/to/sql-lora
code-lora=/path/to/code-lora
--max-lora-rank 64

Performance Tuning

Maximize throughput for batch workloads

vllm serve <model>
--max-num-seqs 256 \ # max concurrent sequences --max-num-batched-tokens 8192 \ # tokens per batch --gpu-memory-utilization 0.95 \ # use 95% VRAM --swap-space 4 # CPU swap (GiB)

Minimize latency for interactive use

vllm serve <model>
--max-num-seqs 32
--enforce-eager # disable CUDA graph capture

Benchmarking

Install benchmark tool

pip install vllm

Run throughput benchmark

python -m vllm.entrypoints.openai.run_batch
--model meta-llama/Llama-3.1-8B-Instruct
--input-file prompts.jsonl
--output-file results.jsonl

Benchmark with vllm bench

vllm bench throughput
--model meta-llama/Llama-3.1-8B-Instruct
--num-prompts 1000
--input-len 512
--output-len 128

Monitoring

Check running server stats

curl http://localhost:8000/metrics # Prometheus metrics

Key metrics to watch:

vllm:num_requests_running - active requests

vllm:gpu_cache_usage_perc - KV cache utilization

vllm:generation_tokens_per_s - throughput

vllm:time_to_first_token_ms - TTFT latency

vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

Issue Cause Fix

CUDA out of memory

Model too large for VRAM Add --quantization awq or reduce --gpu-memory-utilization

Slow cold start Model not cached Pre-pull with huggingface-cli download <model>

Low throughput Too few concurrent requests Increase --max-num-seqs

KV cache full errors Context length too long Set --max-model-len lower

tokenizer error

Tokenizer mismatch Use --tokenizer to specify correct tokenizer

Best Practices

Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.
Pin model versions with --revision for reproducible deployments.
Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.
Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
Enable --enable-chunked-prefill for long-context workloads.
Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

llm-inference-scaling - Auto-scaling vLLM deployments
gpu-server-management - GPU driver setup
llm-gateway - Load balancing across vLLM instances
llm-cost-optimization - Cost management
model-serving-kubernetes - K8s deployment

vllm-server

Safety Notice

Copy this and send it to your AI assistant to learn