High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

Deploying LLMs with low latency requirements
Reducing GPU memory for larger models
Maximizing throughput for batch inference
Edge/mobile deployment with constrained resources
Cost optimization through efficient hardware utilization

Quick Reference

Basic vLLM server

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--max-model-len 8192

With quantization + speculative decoding

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct
--quantization awq
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5}'
--tensor-parallel-size 4
--gpu-memory-utilization 0.9

vLLM 0.14.x Key Features

Feature Benefit

PagedAttention Up to 24x throughput via efficient KV cache

Continuous Batching Dynamic request batching for max utilization

CUDA Graphs Fast model execution with graph capture

Tensor Parallelism Scale across multiple GPUs

Prefix Caching Reuse KV cache for shared prefixes

AttentionConfig New API replacing VLLM_ATTENTION_BACKEND env

Semantic Router vLLM SR v0.1 "Iris" for intelligent LLM routing

Python vLLM Integration

from vllm import LLM, SamplingParams

Initialize with optimization flags

llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", quantization="awq", tensor_parallel_size=2, gpu_memory_utilization=0.9, enable_prefix_caching=True, )

Sampling parameters

sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, )

Generate

outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

Quantization Methods

Method Bits Memory Savings Speed Quality

FP16 16 Baseline Baseline Best

INT8 8 50% +10-20% Very Good

AWQ 4 75% +20-40% Good

GPTQ 4 75% +15-30% Good

FP8 8 50% +30-50% Very Good

When to Use Each:

FP16: Maximum quality, sufficient memory
INT8/FP8: Balance of quality and efficiency
AWQ: Best 4-bit quality, activation-aware
GPTQ: Faster quantization, good quality

Speculative Decoding

Accelerate generation by predicting multiple tokens:

N-gram based (no extra model)

speculative_config = { "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 5, "prompt_lookup_min": 2, }

Draft model (higher quality)

speculative_config = { "method": "draft_model", "draft_model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 3, }

Expected Gains: 1.5-2.5x throughput for autoregressive tasks.

Key Decisions

Decision Recommendation

Quantization AWQ for 4-bit, FP8 for H100/H200

Batch size Dynamic via continuous batching

GPU memory 0.85-0.95 utilization

Parallelism Tensor parallel across GPUs

KV cache Enable prefix caching for shared contexts

Common Mistakes

Using GPTQ without calibration data (poor quality)
Over-allocating GPU memory (OOM on peak loads)
Ignoring warmup requests (cold start latency)
Not benchmarking actual workload patterns
Mixing quantization with incompatible features

Performance Benchmarking

from vllm import LLM, SamplingParams import time

def benchmark_throughput(llm, prompts, sampling_params, num_runs=3): """Benchmark tokens per second.""" total_tokens = 0 total_time = 0

for _ in range(num_runs):
    start = time.perf_counter()
    outputs = llm.generate(prompts, sampling_params)
    elapsed = time.perf_counter() - start

    tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    total_tokens += tokens
    total_time += elapsed

return total_tokens / total_time  # tokens/sec

Advanced Patterns

See references/ for:

vLLM Deployment: PagedAttention, batching, production config
Quantization Guide: AWQ, GPTQ, INT8, FP8 comparison
Speculative Decoding: Draft models, n-gram, throughput tuning
Edge Deployment: Mobile, resource-constrained optimization

Related Skills

llm-streaming
Streaming token responses
function-calling
Tool use with inference
ollama-local
Local inference with Ollama
prompt-caching
Reduce redundant computation
semantic-caching
Cache full responses

Capability Details

vllm-deployment

Keywords: vllm, inference server, deploy, serve, production Solves:

Deploy LLMs with vLLM for production
Configure tensor parallelism and batching
Optimize GPU memory utilization

quantization

Keywords: quantize, AWQ, GPTQ, INT8, FP8, compress, reduce memory Solves:

Reduce model memory footprint
Choose appropriate quantization method
Maintain quality with lower precision

speculative-decoding

Keywords: speculative, draft model, faster generation, predict tokens Solves:

Accelerate autoregressive generation
Configure draft models or n-gram speculation
Tune speculative token count

edge-inference

Keywords: edge, mobile, embedded, constrained, optimization Solves:

Deploy on resource-constrained devices
Optimize for mobile/edge hardware
Balance quality and resource usage

throughput-optimization

Keywords: throughput, latency, performance, benchmark, optimize Solves:

Maximize requests per second
Reduce time to first token
Benchmark and tune performance

high-performance-inference

Safety Notice

Copy this and send it to your AI assistant to learn

Basic vLLM server

With quantization + speculative decoding

Initialize with optimization flags

Sampling parameters

Generate

N-gram based (no extra model)

Draft model (higher quality)

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns