prompt-caching

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-caching" with this command: npx skills add lobbi-docs/claude/lobbi-docs-claude-prompt-caching

Prompt Caching Skill

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

When to Use This Skill

  • RAG systems with large static documents

  • Multi-turn conversations with long instructions

  • Code analysis with large codebase context

  • Batch processing with shared prefixes

  • Document analysis and summarization

Core Concepts

Cache Control Placement

import anthropic

client = anthropic.Anthropic()

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a helpful assistant with access to a large knowledge base...", "cache_control": {"type": "ephemeral"} # Cache this content } ], messages=[{"role": "user", "content": "What is...?"}] )

Cache Hierarchy

Cache breakpoints are checked in this order:

  • Tools - Tool definitions cached first

  • System - System prompts cached second

  • Messages - Conversation history cached last

TTL Options

TTL Write Cost Read Cost Use Case

5 minutes (default) 1.25x base 0.1x base Interactive sessions

1 hour 2.0x base 0.1x base Batch processing, stable docs

Cache Requirements

  • Minimum tokens: 1024-4096 (varies by model)

  • Maximum breakpoints: 4 per request

  • Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

Implementation Patterns

Pattern 1: Single Breakpoint (Recommended)

Best for: Document analysis, Q&A with static context

system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]

Pattern 2: Multi-Turn Conversation

Cache grows with conversation

messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]

Pattern 3: RAG with Multiple Breakpoints

system = [ { "type": "text", "text": "Tool definitions and instructions", "cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools }, { "type": "text", "text": retrieved_documents, "cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents } ]

Pattern 4: Batch Processing with 1-Hour TTL

Warm the cache before batch

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )

Now run batch - all requests hit the cache

for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )

Performance Monitoring

Check Cache Usage

response = client.messages.create(...)

Monitor these fields

cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

Cost Calculation

def calculate_cost(usage, model="claude-sonnet-4-20250514"): # Example rates (check current pricing) base_input_rate = 0.003 # per 1K tokens

write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate

return write_cost + read_cost + uncached_cost

Cache Invalidation

Changes that invalidate cache:

Change Impact

Tool definitions Entire cache invalidated

System prompt System + messages invalidated

Any content before breakpoint That breakpoint + later invalidated

Best Practices

DO:

  • Place breakpoint at END of static content

  • Keep tools/instructions stable across requests

  • Use 1-hour TTL for batch processing

  • Monitor cache_read_input_tokens for savings

DON'T:

  • Place breakpoint in middle of dynamic content

  • Change tool definitions frequently

  • Expect cache to work with <1024 tokens

  • Ignore the 20-block lookback limit

Integration with Extended Thinking

Cache + Extended Thinking

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )

See Also

  • [[llm-integration]] - Claude API basics

  • [[extended-thinking]] - Deep reasoning

  • [[batch-processing]] - Bulk processing

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

vision-multimodal

No summary provided by upstream source.

Repository SourceNeeds Review
General

design-system

No summary provided by upstream source.

Repository SourceNeeds Review
General

gcp

No summary provided by upstream source.

Repository SourceNeeds Review