llm-gateway

A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-gateway" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-llm-gateway

LLM Gateway

A unified API gateway that routes LLM requests across providers and self-hosted models — with rate limiting, cost tracking, caching, and failover.

When to Use This Skill

Use this skill when:

  • Running multiple LLM backends (OpenAI, Anthropic, vLLM, Ollama) behind a single endpoint

  • Enforcing per-team or per-user rate limits and spend budgets

  • Implementing automatic fallback when a provider is down

  • Adding semantic caching to reduce API costs by 20–50%

  • Centralizing API key management instead of distributing keys to every app

Prerequisites

  • Docker and Docker Compose

  • A PostgreSQL or SQLite database (for LiteLLM state)

  • LLM API keys (OpenAI, Anthropic, etc.) or self-hosted vLLM endpoints

  • Optional: Redis for caching and rate limiting

LiteLLM Proxy — Quick Start

LiteLLM is the de facto open-source LLM gateway with OpenAI-compatible API.

Run with Docker

docker run -d
--name litellm-proxy
-p 4000:4000
-e OPENAI_API_KEY=$OPENAI_API_KEY
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
-v $(pwd)/litellm-config.yaml:/app/config.yaml
ghcr.io/berriai/litellm:main-latest
--config /app/config.yaml
--detailed_debug

LiteLLM Configuration

litellm-config.yaml

model_list:

OpenAI models

  • model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: os.environ/OPENAI_API_KEY rpm: 10000 tpm: 2000000

  • model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY

Anthropic

  • model_name: claude-sonnet-4-6 litellm_params: model: anthropic/claude-sonnet-4-6 api_key: os.environ/ANTHROPIC_API_KEY

Self-hosted vLLM instances (load balanced)

  • model_name: llama-3.1-8b litellm_params: model: openai/meta-llama/Llama-3.1-8B-Instruct api_base: http://vllm-1:8000/v1 api_key: fake # vLLM key
  • model_name: llama-3.1-8b litellm_params: model: openai/meta-llama/Llama-3.1-8B-Instruct api_base: http://vllm-2:8000/v1 # second replica — auto load balanced api_key: fake

Fallback: cheap model if primary fails

  • model_name: gpt-4o litellm_params: model: openai/gpt-4o-mini # fallback to cheaper model api_key: os.environ/OPENAI_API_KEY

router_settings: routing_strategy: least-busy # or: latency-based, simple-shuffle num_retries: 3 retry_after: 5 allowed_fails: 2 cooldown_time: 60

Fallback configuration

fallbacks: - gpt-4o: [claude-sonnet-4-6] - claude-sonnet-4-6: [gpt-4o]

litellm_settings:

Semantic caching

cache: true cache_params: type: redis host: redis port: 6379 similarity_threshold: 0.90 # cache if >90% semantic similarity

Logging

success_callback: ["langfuse"] failure_callback: ["langfuse"] langfuse_public_key: os.environ/LANGFUSE_PUBLIC_KEY langfuse_secret_key: os.environ/LANGFUSE_SECRET_KEY

general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: postgresql://litellm:password@postgres:5432/litellm store_model_in_db: true

Docker Compose: Full Gateway Stack

services: litellm: image: ghcr.io/berriai/litellm:main-latest command: ["--config", "/app/config.yaml", "--port", "4000"] volumes: - ./litellm-config.yaml:/app/config.yaml ports: - "4000:4000" environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY} - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm depends_on: postgres: condition: service_healthy redis: condition: service_started restart: unless-stopped

postgres: image: postgres:16-alpine environment: POSTGRES_DB: litellm POSTGRES_USER: litellm POSTGRES_PASSWORD: password volumes: - postgres-data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U litellm"] interval: 5s retries: 5 restart: unless-stopped

redis: image: redis:7-alpine command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru volumes: - redis-data:/data restart: unless-stopped

volumes: postgres-data: redis-data:

Virtual Keys & Rate Limiting

Create a virtual API key for a team (via LiteLLM API)

curl -X POST http://localhost:4000/key/generate
-H "Authorization: Bearer $LITELLM_MASTER_KEY"
-H "Content-Type: application/json"
-d '{ "team_id": "team-backend", "key_alias": "backend-team-key", "models": ["gpt-4o-mini", "llama-3.1-8b"], "max_budget": 100, # USD limit "budget_duration": "monthly", "rpm_limit": 100, # requests per minute "tpm_limit": 500000 # tokens per minute }'

View spend

curl http://localhost:4000/spend/keys
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Nginx Load Balancer (Alternative/Complement)

nginx.conf — round-robin across vLLM replicas

upstream vllm_backends { least_conn; server vllm-1:8000 max_fails=3 fail_timeout=30s; server vllm-2:8000 max_fails=3 fail_timeout=30s; server vllm-3:8000 max_fails=3 fail_timeout=30s; keepalive 32; }

server { listen 80; server_name llm-api.internal;

# Rate limiting
limit_req_zone $http_authorization zone=per_key:10m rate=100r/m;
limit_req zone=per_key burst=20 nodelay;

location /v1/ {
    proxy_pass http://vllm_backends;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_read_timeout 300s;        # long timeout for streaming
    proxy_buffering off;            # required for SSE streaming
    proxy_cache_bypass 1;
}

}

Monitoring Gateway Health

Check LiteLLM health

curl http://localhost:4000/health

Model-level health

curl http://localhost:4000/health/liveliness

Spend by model

curl http://localhost:4000/spend/models
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Active virtual keys

curl http://localhost:4000/key/list
-H "Authorization: Bearer $LITELLM_MASTER_KEY"

Common Issues

Issue Cause Fix

ConnectionRefusedError to backend Backend not reachable Check api_base URL; verify backend is healthy

Rate limit errors (429) Budget/RPM exceeded Increase limits or rotate to fallback model

Slow streaming responses proxy_buffering enabled Set proxy_buffering off in Nginx

Cache miss rate high Threshold too strict Lower similarity_threshold to 0.85

Postgres connection errors DB not ready Add depends_on with condition: service_healthy

Best Practices

  • Use virtual keys per team/app — never expose raw provider API keys.

  • Enable cache: true with Redis for repeated or similar queries; can cut costs 30–50%.

  • Set num_retries: 3 with fallbacks to handle provider outages gracefully.

  • Log all requests to Langfuse or OpenTelemetry for cost attribution and debugging.

  • Use least-busy routing strategy for self-hosted models to avoid GPU saturation.

Related Skills

  • vllm-server - Backend inference server

  • llm-inference-scaling - Auto-scaling backends

  • llm-caching - Semantic cache patterns

  • llm-cost-optimization - Cost management

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review