Semantic Caching
Cache LLM responses by semantic similarity.
Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use redis:8 in Docker or any Redis 8+ deployment.
Cache Hierarchy
Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM) ~1ms ~10ms ~2s ~3s 100% save 100% save 90% save Full cost
Redis Semantic Cache
from redisvl.index import SearchIndex from redisvl.query import VectorQuery
class SemanticCacheService: def init(self, redis_url: str, threshold: float = 0.92): self.client = Redis.from_url(redis_url) self.threshold = threshold
async def get(self, content: str, agent_type: str) -> dict | None:
embedding = await embed_text(content[:2000])
query = VectorQuery(
vector=embedding,
vector_field_name="embedding",
filter_expression=f"@agent_type:{{{agent_type}}}",
num_results=1
)
results = self.index.query(query)
if results:
distance = float(results[0].get("vector_distance", 1.0))
if distance <= (1 - self.threshold):
return json.loads(results[0]["response"])
return None
async def set(self, content: str, response: dict, agent_type: str):
embedding = await embed_text(content[:2000])
key = f"cache:{agent_type}:{hash_content(content)}"
self.client.hset(key, mapping={
"agent_type": agent_type,
"embedding": embedding,
"response": json.dumps(response),
"created_at": time.time(),
})
self.client.expire(key, 86400) # 24h TTL
Similarity Thresholds
Threshold Distance Use Case
0.98-1.00 0.00-0.02 Nearly identical
0.95-0.98 0.02-0.05 Very similar
0.92-0.95 0.05-0.08 Similar (default)
0.85-0.92 0.08-0.15 Moderately similar
Multi-Level Lookup
async def get_llm_response(query: str, agent_type: str) -> dict: # L1: Exact match (in-memory LRU) cache_key = hash_content(query) if cache_key in lru_cache: return lru_cache[cache_key]
# L2: Semantic similarity (Redis)
similar = await semantic_cache.get(query, agent_type)
if similar:
lru_cache[cache_key] = similar # Promote to L1
return similar
# L3/L4: LLM call with prompt caching
response = await llm.generate(query)
# Store in caches
await semantic_cache.set(query, response, agent_type)
lru_cache[cache_key] = response
return response
Redis 8.4+ Hybrid Search (FT.HYBRID)
Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.
Redis 8.4 native hybrid search
result = redis.execute_command( "FT.HYBRID", "llm_cache", "SEARCH", f"@agent_type:{{{agent_type}}}", "VSIM", "@embedding", "$query_vec", "KNN", "2", "K", "5", "COMBINE", "RRF", "4", "CONSTANT", "60", "PARAMS", "2", "query_vec", embedding_bytes )
Hybrid Search Benefits:
-
Single query for keyword + vector matching
-
RRF (Reciprocal Rank Fusion) combines scores intelligently
-
Better results than sequential filtering
-
BM25STD is now the default scorer for keyword matching
When to Use Hybrid:
-
Filtering by metadata (agent_type, tenant, category) + semantic similarity
-
Multi-tenant caches where exact tenant match is required
-
Combining keyword search with vector similarity
Key Decisions
Decision Recommendation
Threshold Start at 0.92, tune based on hit rate
TTL 24h for production
Embedding text-embedding-3-small (fast)
L1 size 1000-10000 entries
Scorer BM25STD (Redis 8+ default)
Hybrid Use FT.HYBRID for metadata + vector queries
Common Mistakes
-
Threshold too low (false positives)
-
No cache warming (cold start)
-
Missing metadata filters
-
Not promoting L2 hits to L1
Related Skills
-
prompt-caching
-
Provider-native caching
-
embeddings
-
Vector generation
-
cache-cost-tracking
-
Langfuse integration
Capability Details
redis-vector-cache
Keywords: redis, vector, embedding, similarity, cache Solves:
-
Cache LLM responses by semantic similarity
-
Reduce API costs with smart caching
-
Implement multi-level cache hierarchy
similarity-threshold
Keywords: threshold, similarity, tuning, cosine Solves:
-
Set appropriate similarity threshold
-
Balance hit rate vs accuracy
-
Tune cache performance
orchestkit-integration
Keywords: orchestkit, integration, roi, cost-savings Solves:
-
Integrate caching with OrchestKit
-
Calculate ROI for caching
-
Production implementation guide
cache-service
Keywords: service, implementation, template, production Solves:
-
Production cache service template
-
Complete implementation example
-
Redis integration code
hybrid-search
Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:
-
Combine semantic and keyword search
-
Filter cache by metadata with vector similarity
-
Use Redis 8.4 FT.HYBRID command
-
BM25STD scoring for keyword matching