LLM Inference
High-performance inference engines for serving large language models.
Engine Comparison
Engine Best For Hardware Throughput Setup
vLLM Production serving GPU Highest Medium
llama.cpp Local/edge, CPU CPU/GPU Good Easy
TGI HuggingFace models GPU High Easy
Ollama Local desktop CPU/GPU Good Easiest
TensorRT-LLM NVIDIA production NVIDIA GPU Highest Complex
Decision Guide
Scenario Recommendation
Production API server vLLM or TGI
Maximum throughput vLLM
Local development Ollama or llama.cpp
CPU-only deployment llama.cpp
Edge/embedded llama.cpp
Apple Silicon llama.cpp with Metal
Quick experimentation Ollama
Privacy-sensitive (no cloud) llama.cpp
vLLM
Production-grade serving with PagedAttention for optimal GPU memory usage.
Key Innovations
Feature What It Does
PagedAttention Non-contiguous KV cache, better memory utilization
Continuous batching Dynamic request grouping for throughput
Speculative decoding Small model drafts, large model verifies
Strengths: Highest throughput, OpenAI-compatible API, multi-GPU Limitations: GPU required, more complex setup
Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.
llama.cpp
C++ inference for running models anywhere—laptops, phones, Raspberry Pi.
Quantization Formats (GGUF)
Format Size (7B) Quality Use Case
Q8_0 ~7 GB Highest When you have RAM
Q6_K ~6 GB High Good balance
Q5_K_M ~5 GB Good Balanced
Q4_K_M ~4 GB OK Memory constrained
Q2_K ~2.5 GB Low Minimum viable
Recommendation: Q4_K_M for best quality/size balance.
Memory Requirements
Model Size Q4_K_M RAM Needed
7B ~4 GB 8 GB
13B ~7 GB 16 GB
30B ~17 GB 32 GB
70B ~38 GB 64 GB
Platform Optimization
Platform Key Setting
Apple Silicon n_gpu_layers=-1 (Metal offload)
CUDA GPU n_gpu_layers=-1
- offload_kqv=True
CPU only n_gpu_layers=0
- set n_threads to core count
Strengths: Runs anywhere, GGUF format, Metal/CUDA support Limitations: Lower throughput than vLLM, single-user focused
Key concept: GGUF format + quantization = run large models on consumer hardware.
Key Optimization Concepts
Technique What It Does When to Use
KV Cache Reuse attention computations Always (automatic)
Continuous Batching Group requests dynamically High-throughput serving
Tensor Parallelism Split model across GPUs Large models
Quantization Reduce precision (fp16→int4) Memory constrained
Speculative Decoding Small model drafts, large verifies Latency sensitive
GPU Offloading Move layers to GPU When GPU available
Common Parameters
Parameter Purpose Typical Value
n_ctx Context window size 2048-8192
n_gpu_layers Layers to offload -1 (all) or 0 (none)
temperature Randomness 0.0-1.0
max_tokens Output limit 100-2000
n_threads CPU threads Match core count
Troubleshooting
Issue Solution
Out of memory Reduce n_ctx, use smaller quant
Slow inference Enable GPU offload, use faster quant
Model won't load Check GGUF integrity, check RAM
Metal not working Reinstall with -DLLAMA_METAL=on
Poor quality Use higher quant (Q5_K_M, Q6_K)
Resources
-
vLLM: https://docs.vllm.ai
-
llama.cpp: https://github.com/ggerganov/llama.cpp
-
Ollama: https://ollama.ai
-
GGUF Models: https://huggingface.co/TheBloke