LLM Serving Patterns
When to Use This Skill
Use this skill when:
-
Designing LLM inference infrastructure
-
Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
-
Implementing quantization for production deployment
-
Optimizing batching and throughput
-
Building streaming response systems
-
Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐ │ LLM Serving Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ Clients (API, Chat UI, Agents) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Load Balancer / API Gateway │ │ │ │ • Rate limiting • Authentication • Request routing │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Inference Server │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ │ │ Request │ │ Batching │ │ KV Cache │ │ │ │ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Model Execution Engine │ │ │ │ │ │ • Tensor operations • Attention • Token sampling │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ GPU/TPU Cluster │ │ │ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │ │ └─────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘
Serving Framework Comparison
Framework Strengths Best For Considerations
vLLM PagedAttention, high throughput, continuous batching General LLM serving, high concurrency Python-native, active community
TGI (Text Generation Inference) Production-ready, Hugging Face integration Enterprise deployment, HF models Rust backend, Docker-first
TensorRT-LLM NVIDIA optimization, lowest latency NVIDIA GPUs, latency-critical NVIDIA-only, complex setup
Triton Inference Server Multi-model, multi-framework Heterogeneous model serving Enterprise complexity
Ollama Simple local deployment Development, edge deployment Limited scaling features
llama.cpp CPU inference, quantization Resource-constrained, edge C++ integration required
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs? ├── Yes → TensorRT-LLM └── No └── Need high throughput with many concurrent users? ├── Yes → vLLM (PagedAttention) └── No └── Need enterprise features + HF integration? ├── Yes → TGI └── No └── Simple local/edge deployment? ├── Yes → Ollama or llama.cpp └── No → vLLM (general purpose)
Quantization Techniques
Precision Levels
Precision Bits Memory Reduction Quality Impact Use Case
FP32 32 Baseline None Training, reference
FP16/BF16 16 2x Minimal Standard serving
INT8 8 4x Low Production serving
INT4 4 8x Moderate Resource-constrained
INT2 2 16x Significant Experimental
Quantization Methods
Method Description Quality Speed
PTQ (Post-Training Quantization) Quantize after training, no retraining Good Fast to apply
QAT (Quantization-Aware Training) Simulate quantization during training Better Requires training
GPTQ One-shot weight quantization Very good Moderate
AWQ (Activation-aware Weight Quantization) Preserves salient weights Excellent Moderate
GGUF/GGML llama.cpp format, CPU-optimized Good Very fast inference
SmoothQuant Migrates difficulty to weights Excellent Moderate
Quantization Selection
Quality vs. Efficiency Trade-off:
Quality ────────────────────────────────────────────▶ Efficiency │ │ │ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │ │ ○───────○────────○──────────○──────────○──────○ │ │ │ │ │ │ │ │ │ │ Best Great Good Good Fair Poor │ │ │
Batching Strategies
Static Batching
Request 1: [tokens: 100] ─┐ Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete Request 3: [tokens: 80] ─┘
Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time ──────────────────────────────────────────────────────────▶
Req 1: [████████████████████████████████] ──▶ Complete Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████] Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]
• New requests join batch as others complete • No padding waste • Optimal GPU utilization
Batching Parameters
Parameter Description Trade-off
max_batch_size
Maximum concurrent requests Memory vs. throughput
max_waiting_tokens
Tokens before forcing batch Latency vs. throughput
max_num_seqs
Maximum sequences in batch Memory vs. concurrency
KV Cache Management
The KV Cache Problem
Attention: Q × K^T × V
For each token generated: • Must recompute attention with ALL previous tokens • K and V tensors grow with sequence length • Memory: O(batch_size × seq_len × num_layers × hidden_dim)
Example (70B model, 4K context): • KV cache per request: ~8GB • 10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache: ┌──────────────────────────────────────────┐ │ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory ├──────────────────────────────────────────┤ │ Request 2 KV Cache (contiguous, fixed) │ ├──────────────────────────────────────────┤ │ FRAGMENTED/WASTED SPACE │ └──────────────────────────────────────────┘
PagedAttention: ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand └────┴────┴────┴────┴────┴────┴────┴────┘ • Non-contiguous memory allocation • Near-zero memory waste • 2-4x higher throughput
KV Cache Optimization Strategies
Strategy Description Memory Savings
Paged Attention Virtual memory for KV cache ~50% reduction
Prefix Caching Reuse KV cache for common prefixes System prompt: 100%
Quantized KV Cache INT8/FP8 for KV values 50-75% reduction
Sliding Window Limited attention context Linear memory
MQA/GQA Grouped query attention Architecture-dependent
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server │ │ │──── GET /v1/chat/completions ──────▶│ │ (stream: true) │ │ │ │◀──── HTTP 200 OK ───────────────────│ │ Content-Type: text/event-stream│ │ │ │◀──── data: {"token": "Hello"} ──────│ │◀──── data: {"token": " world"} ─────│ │◀──── data: {"token": "!"} ──────────│ │◀──── data: [DONE] ──────────────────│ │ │
SSE Benefits:
-
HTTP/1.1 compatible
-
Auto-reconnection support
-
Simple to implement
-
Wide client support
WebSocket Streaming
Client Server │ │ │──── WebSocket Upgrade ─────────────▶│ │◀──── 101 Switching Protocols ───────│ │ │ │──── {"prompt": "Hello"} ───────────▶│ │ │ │◀──── {"token": "Hi"} ───────────────│ │◀──── {"token": " there"} ───────────│ │◀──── {"token": "!"} ────────────────│ │◀──── {"done": true} ────────────────│ │ │
WebSocket Benefits:
-
Bidirectional communication
-
Lower latency
-
Better for chat applications
-
Connection persistence
Streaming Implementation Considerations
Aspect SSE WebSocket
Reconnection Built-in Manual
Scalability Per-request Connection pool
Load Balancing Standard HTTP Sticky sessions
Firewall/Proxy Usually works May need config
Best For One-way streaming Interactive chat
Speculative Decoding
Concept
Standard Decoding: Large Model: [T1] → [T2] → [T3] → [T4] → [T5] 10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding: Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms) │ ▼ Large Model: [Verify T1-T5 in one pass] (15ms) Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗ │ ▼ [Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
Factor Impact
Draft model quality Higher match rate = more speedup
Draft model size Larger = better quality, slower
Speculation depth More tokens = higher risk/reward
Verification cost Must be < sequential generation
Scaling Strategies
Horizontal Scaling
┌─────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (Round-robin, Least-connections) │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ vLLM │ │ vLLM │ │ vLLM │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │ └─────────┘ └─────────┘ └─────────┘
Model Parallelism
Strategy Description Use Case
Tensor Parallelism Split layers across GPUs Single large model
Pipeline Parallelism Different layers on different GPUs Very large models
Data Parallelism Same model, different batches High throughput
Tensor Parallelism (TP=4): ┌─────────────────────────────────────────┐ │ Layer N │ │ GPU0 │ GPU1 │ GPU2 │ GPU3 │ │ 25% │ 25% │ 25% │ 25% │ └─────────────────────────────────────────┘
Pipeline Parallelism (PP=4): GPU0: Layers 0-7 GPU1: Layers 8-15 GPU2: Layers 16-23 GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
-
Choose appropriate quantization (INT8 for production)
-
Enable continuous batching
-
Configure KV cache size appropriately
-
Set optimal batch size for hardware
-
Enable prefix caching for system prompts
Runtime
-
Monitor GPU memory utilization
-
Track p50/p95/p99 latencies
-
Measure time-to-first-token (TTFT)
-
Monitor tokens-per-second (TPS)
-
Set appropriate timeouts
Infrastructure
-
Use fastest available interconnect (NVLink, InfiniBand)
-
Minimize network hops
-
Place inference close to users (edge)
-
Consider dedicated inference hardware
Cost Optimization
Cost Drivers
Factor Impact Optimization
GPU hours Highest Quantization, batching
Memory High PagedAttention, KV cache optimization
Network Medium Response compression, edge deployment
Storage Low Model deduplication
Cost Estimation Formula
Monthly Cost = (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour) ───────────────────────────────────────────────────────────────────────────── 3600
Example: • 10M requests/month • 500 tokens average • 0.001 GPU-seconds/token (optimized) • $2/GPU-hour
Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
┌─────────────────────────────────────────────────────────┐ │ Router │ │ • Classify request complexity │ │ • Route to appropriate model │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Small │ │ Medium │ │ Large │ │ Model │ │ Model │ │ Model │ │ (7B) │ │ (13B) │ │ (70B) │ │ Fast │ │ Balanced│ │ Quality │ └─────────┘ └─────────┘ └─────────┘
Caching Strategies
Cache Type What to Cache TTL
Prompt cache Common system prompts Long
KV cache Prefix tokens Session
Response cache Exact query matches Varies
Embedding cache Document embeddings Long
Related Skills
-
ml-system-design
-
End-to-end ML pipeline design
-
rag-architecture
-
Retrieval-augmented generation patterns
-
vector-databases
-
Vector search for LLM context
-
ml-inference-optimization
-
General inference optimization
-
estimation-techniques
-
Capacity planning for LLM systems
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26