LLM Gateway

A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.

When to Use This Skill

Use this skill when:

Running multiple LLM backends (OpenAI, Anthropic, vLLM, Ollama) behind a single endpoint
Enforcing per-team or per-user rate limits and spend budgets
Implementing automatic fallback when a provider is down
Adding semantic caching to reduce API costs by 20–50%
Centralizing API key management instead of distributing keys to every app

Prerequisites

Docker and Docker Compose
A PostgreSQL or SQLite database (for LiteLLM state)
LLM API keys (OpenAI, Anthropic, etc.) or self-hosted vLLM endpoints
Optional: Redis for caching and rate limiting

LiteLLM Proxy — Quick Start

LiteLLM is the de facto open-source LLM gateway with OpenAI-compatible API.

Run with Docker

docker run -d
--name litellm-proxy
-p 4000:4000
-e OPENAI_API_KEY=$OPENAI_API_KEY
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
-v $(pwd)/litellm-config.yaml:/app/config.yaml
ghcr.io/berriai/litellm:main-latest
--config /app/config.yaml
--detailed_debug

LiteLLM Configuration

litellm-config.yaml

model_list:

OpenAI models

model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: os.environ/OPENAI_API_KEY rpm: 10000 tpm: 2000000
model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY

Anthropic

model_name: claude-sonnet-4-6 litellm_params: model: anthropic/claude-sonnet-4-6 api_key: os.environ/ANTHROPIC_API_KEY

Self-hosted vLLM instances (load balanced)

model_name: llama-3.1-8b litellm_params: model: openai/meta-llama/Llama-3.1-8B-Instruct api_base: http://vllm-1:8000/v1 api_key: fake # vLLM key
model_name: llama-3.1-8b litellm_params: model: openai/meta-llama/Llama-3.1-8B-Instruct api_base: http://vllm-2:8000/v1 # second replica — auto load balanced api_key: fake

Fallback: cheap model if primary fails

model_name: gpt-4o litellm_params: model: openai/gpt-4o-mini # fallback to cheaper model api_key: os.environ/OPENAI_API_KEY

router_settings: routing_strategy: least-busy # or: latency-based, simple-shuffle num_retries: 3 retry_after: 5 allowed_fails: 2 cooldown_time: 60

Fallback configuration

fallbacks: - gpt-4o: [claude-sonnet-4-6] - claude-sonnet-4-6: [gpt-4o]

litellm_settings:

Semantic caching

cache: true cache_params: type: redis host: redis port: 6379 similarity_threshold: 0.90 # cache if >90% semantic similarity

Logging

success_callback: ["langfuse"] failure_callback: ["langfuse"] langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY

general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: postgresql://litellm:password@postgres:5432/litellm store_model_in_db: true

Docker Compose: Full Gateway Stack

services: litellm: image: ghcr.io/berriai/litellm:main-latest command: ["--config", "/app/config.yaml", "--port", "4000"] volumes: - ./litellm-config.yaml:/app/config.yaml ports: - "4000:4000" environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm depends_on: postgres: condition: service_healthy redis: condition: service_started restart: unless-stopped

postgres: image: postgres:16-alpine environment: POSTGRES_DB: litellm POSTGRES_USER: litellm POSTGRES_PASSWORD: password volumes: - postgres-data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U litellm"] interval: 5s retries: 5 restart: unless-stopped

redis: image: redis:7-alpine command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru volumes: - redis-data:/data restart: unless-stopped

volumes: postgres-data: redis-data:

Virtual Keys & Rate Limiting

Create a virtual API key for a team (via LiteLLM API)

curl -X POST http://localhost:4000/key/generate
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
-H "Content-Type: application/json"
-d '{ "team_id": "team-backend", "key_alias": "backend-team-key", "models": ["gpt-4o-mini", "llama-3.1-8b"], "max_budget": 100, # USD limit "budget_duration": "monthly", "rpm_limit": 100, # requests per minute "tpm_limit": 500000 # tokens per minute }'

View spend

curl http://localhost:4000/spend/keys
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Nginx Load Balancer (Alternative/Complement)

nginx.conf — round-robin across vLLM replicas

upstream vllm_backends { least_conn; server vllm-1:8000 max_fails=3 fail_timeout=30s; server vllm-2:8000 max_fails=3 fail_timeout=30s; server vllm-3:8000 max_fails=3 fail_timeout=30s; keepalive 32; }

server { listen 80; server_name llm-api.internal;

# Rate limiting
limit_req_zone $http_authorization zone=per_key:10m rate=100r/m;
limit_req zone=per_key burst=20 nodelay;

location /v1/ {
    proxy_pass http://vllm_backends;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_read_timeout 300s;        # long timeout for streaming
    proxy_buffering off;            # required for SSE streaming
    proxy_cache_bypass 1;
}

}

Monitoring Gateway Health

Check LiteLLM health

curl http://localhost:4000/health

Model-level health

curl http://localhost:4000/health/liveliness

Spend by model

curl http://localhost:4000/spend/models
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Active virtual keys

curl http://localhost:4000/key/list
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Common Issues

Issue Cause Fix

ConnectionRefusedError to backend Backend not reachable Check api_base URL; verify backend is healthy

Rate limit errors (429) Budget/RPM exceeded Increase limits or rotate to fallback model

Slow streaming responses proxy_buffering enabled Set proxy_buffering off in Nginx

Cache miss rate high Threshold too strict Lower similarity_threshold to 0.85

Postgres connection errors DB not ready Add depends_on with condition: service_healthy

Best Practices

Use virtual keys per team/app — never expose raw provider API keys.
Enable cache: true with Redis for repeated or similar queries; can cut costs 30–50%.
Set num_retries: 3 with fallbacks to handle provider outages gracefully.
Log all requests to Langfuse or OpenTelemetry for cost attribution and debugging.
Use least-busy routing strategy for self-hosted models to avoid GPU saturation.

Related Skills

vllm-server - Backend inference server
llm-inference-scaling - Auto-scaling backends
llm-caching - Semantic cache patterns
llm-cost-optimization - Cost management

llm-gateway

Safety Notice

Copy this and send it to your AI assistant to learn