LLM Cost Optimization

Cut LLM costs by 50–90% with the right combination of caching, model selection, prompt optimization, and self-hosting.

When to Use This Skill

Use this skill when:

LLM API spend is growing faster than revenue
You need to attribute AI costs to teams, products, or customers
Implementing caching to avoid redundant LLM calls
Deciding when to switch from API providers to self-hosted models
Optimizing prompt length without sacrificing quality

Cost Levers by Impact

Strategy Typical Savings Effort

Semantic caching 20–50% Low

Model right-sizing 30–70% Low

Prompt compression 10–30% Medium

Provider caching (prompt cache) 10–25% Low

Batching offline workloads 50% (Batch API) Medium

Self-hosting 7–8B models 80–95% at scale High

Quantization 30–50% VRAM cost Medium

Track Costs First

Use LiteLLM's cost tracking (automatic per-model pricing)

import litellm

response = litellm.completion( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello"}], ) cost = litellm.completion_cost(response) print(f"Cost: ${cost:.6f}")

Add custom cost callbacks

def log_cost(kwargs, completion_response, start_time, end_time): cost = kwargs.get("response_cost", 0) model = kwargs.get("model") user = kwargs.get("user") # Send to your analytics DB db.record_cost(user=user, model=model, cost=cost)

litellm.success_callback = [log_cost]

Model Right-Sizing

Route by task complexity — don't use GPT-4o for everything

def get_model_for_task(task_type: str) -> str: routing = { "classification": "gpt-4o-mini", # ~30× cheaper than gpt-4o "summarization": "gpt-4o-mini", "extraction": "gpt-4o-mini", "simple_qa": "gpt-4o-mini", "complex_reasoning": "gpt-4o", "code_generation": "claude-sonnet-4-6", "creative_writing": "claude-opus-4-6", } return routing.get(task_type, "gpt-4o-mini")

Cost comparison (per 1M tokens, 2025 approx.)

gpt-4o-mini: input $0.15 / output $0.60

gpt-4o: input $2.50 / output $10.00

claude-sonnet-4-6: input $3.00 / output $15.00

llama-3.1-8b (self): ~$0.05–0.10 all-in (GPU amortized)

Prompt Caching (Provider-Side)

Anthropic — cache long system prompts (saves 90% on cached tokens)

import anthropic

client = anthropic.Anthropic()

response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant.", }, { "type": "text", "text": open("large-context.txt").read(), # large doc "cache_control": {"type": "ephemeral"}, # cache this! } ], messages=[{"role": "user", "content": "Summarize the key points."}], )

First call: full price. Subsequent calls: 90% discount on cached part.

print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

OpenAI — prompt caching is automatic for repeated prefixes >1024 tokens

No code change needed; check usage.prompt_tokens_details.cached_tokens

Batching with OpenAI Batch API (50% Discount)

import json from openai import OpenAI

client = OpenAI()

Prepare batch requests

requests = [ { "custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o-mini", "messages": [{"role": "user", "content": f"Classify: {text}"}], "max_tokens": 50, } } for i, text in enumerate(texts) ]

Write JSONL file

with open("batch.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")

Upload and create batch

batch_file = client.files.create(file=open("batch.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h", ) print(f"Batch ID: {batch.id}") # poll status with client.batches.retrieve(batch.id)

Semantic Caching

import hashlib import json import redis import numpy as np from sentence_transformers import SentenceTransformer

r = redis.Redis(host="localhost", port=6379) embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

SIMILARITY_THRESHOLD = 0.92 CACHE_TTL = 3600 * 24 # 24 hours

def cached_llm_call(prompt: str, llm_fn) -> str: # 1. Exact match (free) exact_key = f"exact:{hashlib.sha256(prompt.encode()).hexdigest()}" if cached := r.get(exact_key): return cached.decode()

# 2. Semantic match
query_vec = embed_model.encode(prompt)
cached_keys = r.keys("sem:*")
for key in cached_keys:
    data = json.loads(r.get(key))
    similarity = np.dot(query_vec, data["embedding"]) / (
        np.linalg.norm(query_vec) * np.linalg.norm(data["embedding"])
    )
    if similarity >= SIMILARITY_THRESHOLD:
        return data["response"]

# 3. Cache miss — call LLM
response = llm_fn(prompt)

# Store exact match
r.setex(exact_key, CACHE_TTL, response)

# Store semantic embedding
sem_key = f"sem:{hashlib.sha256(prompt.encode()).hexdigest()}"
r.setex(sem_key, CACHE_TTL, json.dumps({
    "embedding": query_vec.tolist(),
    "response": response,
    "prompt": prompt,
}))
return response

Prompt Compression

LLMLingua — compress long prompts by 3–20× with minimal quality loss

from llmlingua import PromptCompressor

compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", device_map="cpu", )

compressed = compressor.compress_prompt( long_context, ratio=0.5, # keep 50% of tokens rank_method="longllmlingua", ) print(f"Original: {len(long_context.split())} words") print(f"Compressed: {len(compressed['compressed_prompt'].split())} words") print(f"Savings: {compressed['saving']}")

Self-Hosting Break-Even Calculator

def break_even_analysis( monthly_api_spend_usd: float, gpu_cost_per_hour_usd: float = 2.50, # e.g., A10G on AWS utilization: float = 0.70, # 70% GPU utilization ) -> dict: monthly_gpu_cost = gpu_cost_per_hour_usd * 24 * 30 * utilization break_even = monthly_gpu_cost / monthly_api_spend_usd recommendation = ( "Self-host now — strong ROI" if break_even < 0.5 else "Self-host if traffic grows 2×" if break_even < 0.8 else "Stick with API — not enough scale yet" ) return { "monthly_gpu_cost": f"${monthly_gpu_cost:.0f}", "monthly_api_spend": f"${monthly_api_spend_usd:.0f}", "gpu_as_pct_of_api": f"{break_even*100:.0f}%", "recommendation": recommendation, }

Example: $5k/month on OpenAI, $2.50/hr A10G

print(break_even_analysis(5000))

→ gpu_cost ~$1,260/mo = 25% of API spend → self-host now

Cost Dashboard (Grafana)

Emit cost metrics to Prometheus

from prometheus_client import Counter, Histogram

llm_cost_total = Counter( "llm_cost_usd_total", "Total LLM spend in USD", ["model", "team", "task_type"], ) llm_tokens_total = Counter( "llm_tokens_total", "Total tokens used", ["model", "token_type"], # token_type: prompt, completion, cached )

def track_call(model, team, task_type, response): cost = calculate_cost(model, response.usage) llm_cost_total.labels(model=model, team=team, task_type=task_type).inc(cost) llm_tokens_total.labels(model=model, token_type="prompt").inc( response.usage.prompt_tokens) llm_tokens_total.labels(model=model, token_type="completion").inc( response.usage.completion_tokens)

Best Practices

Use gpt-4o-mini or claude-haiku for 80% of tasks — they're 10–30× cheaper.
Enable prompt caching for system prompts >1,024 tokens (Anthropic) or >1,024 tokens (OpenAI).
Audit your top 5 prompts by token count — compress or cache them.
Set hard budget limits with LiteLLM virtual keys before costs spiral.
Self-host 7B–8B models when monthly API spend exceeds $2k/month.

Related Skills

llm-gateway - Centralized cost control
llm-caching - Semantic caching patterns
vllm-server - Self-hosted inference
agent-observability - Token and cost telemetry

llm-cost-optimization

Safety Notice

Copy this and send it to your AI assistant to learn