Prompt Caching Skill
Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.
When to Use This Skill
-
RAG systems with large static documents
-
Multi-turn conversations with long instructions
-
Code analysis with large codebase context
-
Batch processing with shared prefixes
-
Document analysis and summarization
Core Concepts
Cache Control Placement
import anthropic
client = anthropic.Anthropic()
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant with access to a large knowledge base...", "cache_control": {"type": "ephemeral"} # Cache this content } ], messages=[{"role": "user", "content": "What is...?"}] )
Cache Hierarchy
Cache breakpoints are checked in this order:
-
Tools - Tool definitions cached first
-
System - System prompts cached second
-
Messages - Conversation history cached last
TTL Options
TTL Write Cost Read Cost Use Case
5 minutes (default) 1.25x base 0.1x base Interactive sessions
1 hour 2.0x base 0.1x base Batch processing, stable docs
Cache Requirements
-
Minimum tokens: 1024-4096 (varies by model)
-
Maximum breakpoints: 4 per request
-
Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5
Implementation Patterns
Pattern 1: Single Breakpoint (Recommended)
Best for: Document analysis, Q&A with static context
system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]
Pattern 2: Multi-Turn Conversation
Cache grows with conversation
messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]
Pattern 3: RAG with Multiple Breakpoints
system = [ { "type": "text", "text": "Tool definitions and instructions", "cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools }, { "type": "text", "text": retrieved_documents, "cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents } ]
Pattern 4: Batch Processing with 1-Hour TTL
Warm the cache before batch
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )
Now run batch - all requests hit the cache
for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )
Performance Monitoring
Check Cache Usage
response = client.messages.create(...)
Monitor these fields
cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
Cost Calculation
def calculate_cost(usage, model="claude-sonnet-4-20250514"): # Example rates (check current pricing) base_input_rate = 0.003 # per 1K tokens
write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate
return write_cost + read_cost + uncached_cost
Cache Invalidation
Changes that invalidate cache:
Change Impact
Tool definitions Entire cache invalidated
System prompt System + messages invalidated
Any content before breakpoint That breakpoint + later invalidated
Best Practices
DO:
-
Place breakpoint at END of static content
-
Keep tools/instructions stable across requests
-
Use 1-hour TTL for batch processing
-
Monitor cache_read_input_tokens for savings
DON'T:
-
Place breakpoint in middle of dynamic content
-
Change tool definitions frequently
-
Expect cache to work with <1024 tokens
-
Ignore the 20-block lookback limit
Integration with Extended Thinking
Cache + Extended Thinking
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )
See Also
-
[[llm-integration]] - Claude API basics
-
[[extended-thinking]] - Deep reasoning
-
[[batch-processing]] - Bulk processing