High-Performance Inference
Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.
vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.
Overview
-
Deploying LLMs with low latency requirements
-
Reducing GPU memory for larger models
-
Maximizing throughput for batch inference
-
Edge/mobile deployment with constrained resources
-
Cost optimization through efficient hardware utilization
Quick Reference
Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--max-model-len 8192
With quantization + speculative decoding
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--quantization awq
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}'
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
vLLM 0.14.x Key Features
Feature Benefit
PagedAttention Up to 24x throughput via efficient KV cache
Continuous Batching Dynamic request batching for max utilization
CUDA Graphs Fast model execution with graph capture
Tensor Parallelism Scale across multiple GPUs
Prefix Caching Reuse KV cache for shared prefixes
AttentionConfig New API replacing VLLM_ATTENTION_BACKEND env
Semantic Router vLLM SR v0.1 "Iris" for intelligent LLM routing
Python vLLM Integration
from vllm import LLM, SamplingParams
Initialize with optimization flags
llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", quantization="awq", tensor_parallel_size=2, gpu_memory_utilization=0.9, enable_prefix_caching=True, )
Sampling parameters
sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, )
Generate
outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)
Quantization Methods
Method Bits Memory Savings Speed Quality
FP16 16 Baseline Baseline Best
INT8 8 50% +10-20% Very Good
AWQ 4 75% +20-40% Good
GPTQ 4 75% +15-30% Good
FP8 8 50% +30-50% Very Good
When to Use Each:
-
FP16: Maximum quality, sufficient memory
-
INT8/FP8: Balance of quality and efficiency
-
AWQ: Best 4-bit quality, activation-aware
-
GPTQ: Faster quantization, good quality
Speculative Decoding
Accelerate generation by predicting multiple tokens:
N-gram based (no extra model)
speculative_config = { "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2, }
Draft model (higher quality)
speculative_config = { "method": "draft_model", "draft_model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 3, }
Expected Gains: 1.5-2.5x throughput for autoregressive tasks.
Key Decisions
Decision Recommendation
Quantization AWQ for 4-bit, FP8 for H100/H200
Batch size Dynamic via continuous batching
GPU memory 0.85-0.95 utilization
Parallelism Tensor parallel across GPUs
KV cache Enable prefix caching for shared contexts
Common Mistakes
-
Using GPTQ without calibration data (poor quality)
-
Over-allocating GPU memory (OOM on peak loads)
-
Ignoring warmup requests (cold start latency)
-
Not benchmarking actual workload patterns
-
Mixing quantization with incompatible features
Performance Benchmarking
from vllm import LLM, SamplingParams import time
def benchmark_throughput(llm, prompts, sampling_params, num_runs=3): """Benchmark tokens per second.""" total_tokens = 0 total_time = 0
for _ in range(num_runs):
start = time.perf_counter()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.perf_counter() - start
tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
total_tokens += tokens
total_time += elapsed
return total_tokens / total_time # tokens/sec
Advanced Patterns
See references/ for:
-
vLLM Deployment: PagedAttention, batching, production config
-
Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
-
Speculative Decoding: Draft models, n-gram, throughput tuning
-
Edge Deployment: Mobile, resource-constrained optimization
Related Skills
-
llm-streaming
-
Streaming token responses
-
function-calling
-
Tool use with inference
-
ollama-local
-
Local inference with Ollama
-
prompt-caching
-
Reduce redundant computation
-
semantic-caching
-
Cache full responses
Capability Details
vllm-deployment
Keywords: vllm, inference server, deploy, serve, production Solves:
-
Deploy LLMs with vLLM for production
-
Configure tensor parallelism and batching
-
Optimize GPU memory utilization
quantization
Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:
-
Reduce model memory footprint
-
Choose appropriate quantization method
-
Maintain quality with lower precision
speculative-decoding
Keywords: speculative, draft model, faster generation, predict tokens Solves:
-
Accelerate autoregressive generation
-
Configure draft models or n-gram speculation
-
Tune speculative token count
edge-inference
Keywords: edge, mobile, embedded, constrained, optimization Solves:
-
Deploy on resource-constrained devices
-
Optimize for mobile/edge hardware
-
Balance quality and resource usage
throughput-optimization
Keywords: throughput, latency, performance, benchmark, optimize Solves:
-
Maximize requests per second
-
Reduce time to first token
-
Benchmark and tune performance