Prompt Caching Skill

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

When to Use This Skill

RAG systems with large static documents
Multi-turn conversations with long instructions
Code analysis with large codebase context
Batch processing with shared prefixes
Document analysis and summarization

Core Concepts

Cache Control Placement

import anthropic

client = anthropic.Anthropic()

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant with access to a large knowledge base...", "cache_control": {"type": "ephemeral"} # Cache this content } ], messages=[{"role": "user", "content": "What is...?"}] )

Cache Hierarchy

Cache breakpoints are checked in this order:

Tools - Tool definitions cached first
System - System prompts cached second
Messages - Conversation history cached last

TTL Options

TTL Write Cost Read Cost Use Case

5 minutes (default) 1.25x base 0.1x base Interactive sessions

1 hour 2.0x base 0.1x base Batch processing, stable docs

Cache Requirements

Minimum tokens: 1024-4096 (varies by model)
Maximum breakpoints: 4 per request
Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

Best for: Document analysis, Q&A with static context

system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]

Pattern 2: Multi-Turn Conversation

Cache grows with conversation

messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]

Pattern 3: RAG with Multiple Breakpoints

system = [ { "type": "text", "text": "Tool definitions and instructions", "cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools }, { "type": "text", "text": retrieved_documents, "cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents } ]

Pattern 4: Batch Processing with 1-Hour TTL

Warm the cache before batch

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )

Now run batch - all requests hit the cache

for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )

Performance Monitoring

Check Cache Usage

response = client.messages.create(...)

Monitor these fields

cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

Cost Calculation

def calculate_cost(usage, model="claude-sonnet-4-20250514"): # Example rates (check current pricing) base_input_rate = 0.003 # per 1K tokens

write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate

return write_cost + read_cost + uncached_cost

Cache Invalidation

Changes that invalidate cache:

Change Impact

Tool definitions Entire cache invalidated

System prompt System + messages invalidated

Any content before breakpoint That breakpoint + later invalidated

Best Practices

DO:

Place breakpoint at END of static content
Keep tools/instructions stable across requests
Use 1-hour TTL for batch processing
Monitor cache_read_input_tokens for savings

DON'T:

Place breakpoint in middle of dynamic content
Change tool definitions frequently
Expect cache to work with <1024 tokens
Ignore the 20-block lookback limit

Integration with Extended Thinking

Cache + Extended Thinking

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )

prompt-caching

Safety Notice

Copy this and send it to your AI assistant to learn

Best for: Document analysis, Q&A with static context

Cache grows with conversation

Warm the cache before batch

Now run batch - all requests hit the cache

Monitor these fields

Cache + Extended Thinking

Source Transparency

Related Skills

vision-multimodal

design-system

gcp