llm-serving-patterns

LLM Serving Patterns

When to Use This Skill

Use this skill when:

Designing LLM inference infrastructure
Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
Implementing quantization for production deployment
Optimizing batching and throughput
Building streaming response systems
Scaling LLM deployments cost-effectively

Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

LLM Serving Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐ │ LLM Serving Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ Clients (API, Chat UI, Agents) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Load Balancer / API Gateway │ │ │ │ • Rate limiting • Authentication • Request routing │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Inference Server │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ │ │ Request │ │ Batching │ │ KV Cache │ │ │ │ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Model Execution Engine │ │ │ │ │ │ • Tensor operations • Attention • Token sampling │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ GPU/TPU Cluster │ │ │ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │ │ └─────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘

Serving Framework Comparison

Framework Strengths Best For Considerations

vLLM PagedAttention, high throughput, continuous batching General LLM serving, high concurrency Python-native, active community

TGI (Text Generation Inference) Production-ready, Hugging Face integration Enterprise deployment, HF models Rust backend, Docker-first

TensorRT-LLM NVIDIA optimization, lowest latency NVIDIA GPUs, latency-critical NVIDIA-only, complex setup

Triton Inference Server Multi-model, multi-framework Heterogeneous model serving Enterprise complexity

Ollama Simple local deployment Development, edge deployment Limited scaling features

llama.cpp CPU inference, quantization Resource-constrained, edge C++ integration required

Framework Selection Decision Tree

Need lowest latency on NVIDIA GPUs? ├── Yes → TensorRT-LLM └── No └── Need high throughput with many concurrent users? ├── Yes → vLLM (PagedAttention) └── No └── Need enterprise features + HF integration? ├── Yes → TGI └── No └── Simple local/edge deployment? ├── Yes → Ollama or llama.cpp └── No → vLLM (general purpose)

Quantization Techniques

Precision Levels

Precision Bits Memory Reduction Quality Impact Use Case

FP32 32 Baseline None Training, reference

FP16/BF16 16 2x Minimal Standard serving

INT8 8 4x Low Production serving

INT4 4 8x Moderate Resource-constrained

INT2 2 16x Significant Experimental

Quantization Methods

Method Description Quality Speed

PTQ (Post-Training Quantization) Quantize after training, no retraining Good Fast to apply

QAT (Quantization-Aware Training) Simulate quantization during training Better Requires training

GPTQ One-shot weight quantization Very good Moderate

AWQ (Activation-aware Weight Quantization) Preserves salient weights Excellent Moderate

GGUF/GGML llama.cpp format, CPU-optimized Good Very fast inference

SmoothQuant Migrates difficulty to weights Excellent Moderate

Quantization Selection

Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency │ │ │ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │ │ ○───────○────────○──────────○──────────○──────○ │ │ │ │ │ │ │ │ │ │ Best Great Good Good Fair Poor │ │ │

Batching Strategies

Static Batching

Request 1: [tokens: 100] ─┐ Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete Request 3: [tokens: 80] ─┘

Problem: Short requests wait for long ones (head-of-line blocking)

Continuous Batching (Preferred)

Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████] Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete • No padding waste • Optimal GPU utilization

Batching Parameters

Parameter Description Trade-off

max_batch_size

Maximum concurrent requests Memory vs. throughput

max_waiting_tokens

Tokens before forcing batch Latency vs. throughput

max_num_seqs

Maximum sequences in batch Memory vs. concurrency

KV Cache Management

The KV Cache Problem

Attention: Q × K^T × V

For each token generated: • Must recompute attention with ALL previous tokens • K and V tensors grow with sequence length • Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context): • KV cache per request: ~8GB • 10 concurrent requests: ~80GB GPU memory

PagedAttention (vLLM Innovation)

Traditional KV Cache: ┌──────────────────────────────────────────┐ │ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory ├──────────────────────────────────────────┤ │ Request 2 KV Cache (contiguous, fixed) │ ├──────────────────────────────────────────┤ │ FRAGMENTED/WASTED SPACE │ └──────────────────────────────────────────┘

PagedAttention: ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand └────┴────┴────┴────┴────┴────┴────┴────┘ • Non-contiguous memory allocation • Near-zero memory waste • 2-4x higher throughput

KV Cache Optimization Strategies

Strategy Description Memory Savings

Paged Attention Virtual memory for KV cache ~50% reduction

Prefix Caching Reuse KV cache for common prefixes System prompt: 100%

Quantized KV Cache INT8/FP8 for KV values 50-75% reduction

Sliding Window Limited attention context Linear memory

MQA/GQA Grouped query attention Architecture-dependent

Streaming Response Patterns

Server-Sent Events (SSE)

Client Server │ │ │──── GET /v1/chat/completions ──────▶│ │ (stream: true) │ │ │ │◀──── HTTP 200 OK ───────────────────│ │ Content-Type: text/event-stream│ │ │ │◀──── data: {"token": "Hello"} ──────│ │◀──── data: {"token": " world"} ─────│ │◀──── data: {"token": "!"} ──────────│ │◀──── data: [DONE] ──────────────────│ │ │

SSE Benefits:

HTTP/1.1 compatible
Auto-reconnection support
Simple to implement
Wide client support

WebSocket Streaming

Client Server │ │ │──── WebSocket Upgrade ─────────────▶│ │◀──── 101 Switching Protocols ───────│ │ │ │──── {"prompt": "Hello"} ───────────▶│ │ │ │◀──── {"token": "Hi"} ───────────────│ │◀──── {"token": " there"} ───────────│ │◀──── {"token": "!"} ────────────────│ │◀──── {"done": true} ────────────────│ │ │

WebSocket Benefits:

Bidirectional communication
Lower latency
Better for chat applications
Connection persistence

Streaming Implementation Considerations

Aspect SSE WebSocket

Reconnection Built-in Manual

Scalability Per-request Connection pool

Load Balancing Standard HTTP Sticky sessions

Firewall/Proxy Usually works May need config

Best For One-way streaming Interactive chat

Speculative Decoding

Concept

Standard Decoding: Large Model: [T1] → [T2] → [T3] → [T4] → [T5] 10ms 10ms 10ms 10ms 10ms = 50ms total

Speculative Decoding: Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms) │ ▼ Large Model: [Verify T1-T5 in one pass] (15ms) Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗ │ ▼ [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)

Speculative Decoding Trade-offs

Factor Impact

Draft model quality Higher match rate = more speedup

Draft model size Larger = better quality, slower

Speculation depth More tokens = higher risk/reward

Verification cost Must be < sequential generation

Scaling Strategies

Horizontal Scaling

┌─────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (Round-robin, Least-connections) │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ vLLM │ │ vLLM │ │ vLLM │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │ └─────────┘ └─────────┘ └─────────┘

Model Parallelism

Strategy Description Use Case

Tensor Parallelism Split layers across GPUs Single large model

Pipeline Parallelism Different layers on different GPUs Very large models

Data Parallelism Same model, different batches High throughput

Tensor Parallelism (TP=4): ┌─────────────────────────────────────────┐ │ Layer N │ │ GPU0 │ GPU1 │ GPU2 │ GPU3 │ │ 25% │ 25% │ 25% │ 25% │ └─────────────────────────────────────────┘

Pipeline Parallelism (PP=4): GPU0: Layers 0-7 GPU1: Layers 8-15 GPU2: Layers 16-23 GPU3: Layers 24-31

Latency Optimization Checklist

Pre-deployment

Choose appropriate quantization (INT8 for production)
Enable continuous batching
Configure KV cache size appropriately
Set optimal batch size for hardware
Enable prefix caching for system prompts

Runtime

Monitor GPU memory utilization
Track p50/p95/p99 latencies
Measure time-to-first-token (TTFT)
Monitor tokens-per-second (TPS)
Set appropriate timeouts

Infrastructure

Use fastest available interconnect (NVLink, InfiniBand)
Minimize network hops
Place inference close to users (edge)
Consider dedicated inference hardware

Cost Optimization

Cost Drivers

Factor Impact Optimization

GPU hours Highest Quantization, batching

Memory High PagedAttention, KV cache optimization

Network Medium Response compression, edge deployment

Storage Low Model deduplication

Cost Estimation Formula

Monthly Cost = (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour) ───────────────────────────────────────────────────────────────────────────── 3600

Example: • 10M requests/month • 500 tokens average • 0.001 GPU-seconds/token (optimized) • $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month

Common Patterns

Multi-model Routing

┌─────────────────────────────────────────────────────────┐ │ Router │ │ • Classify request complexity │ │ • Route to appropriate model │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Small │ │ Medium │ │ Large │ │ Model │ │ Model │ │ Model │ │ (7B) │ │ (13B) │ │ (70B) │ │ Fast │ │ Balanced│ │ Quality │ └─────────┘ └─────────┘ └─────────┘

Caching Strategies

Cache Type What to Cache TTL

Prompt cache Common system prompts Long

KV cache Prefix tokens Session

Response cache Exact query matches Varies

Embedding cache Document embeddings Long

Related Skills

ml-system-design
End-to-end ML pipeline design
rag-architecture
Retrieval-augmented generation patterns
vector-databases
Vector search for LLM context
ml-inference-optimization
General inference optimization
estimation-techniques
Capacity planning for LLM systems

Version History

v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

Last Updated

Date: 2025-12-26

llm-serving-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering

swot-pestle-analysis