Semantic Caching

Cache LLM responses by semantic similarity.

Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use redis:8 in Docker or any Redis 8+ deployment.

Cache Hierarchy

Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM) ~1ms ~10ms ~2s ~3s 100% save 100% save 90% save Full cost

Redis Semantic Cache

from redisvl.index import SearchIndex from redisvl.query import VectorQuery

class SemanticCacheService: def init(self, redis_url: str, threshold: float = 0.92): self.client = Redis.from_url(redis_url) self.threshold = threshold

async def get(self, content: str, agent_type: str) -> dict | None:
    embedding = await embed_text(content[:2000])

    query = VectorQuery(
        vector=embedding,
        vector_field_name="embedding",
        filter_expression=f"@agent_type:{{{agent_type}}}",
        num_results=1
    )

    results = self.index.query(query)

    if results:
        distance = float(results[0].get("vector_distance", 1.0))
        if distance &#x3C;= (1 - self.threshold):
            return json.loads(results[0]["response"])

    return None

async def set(self, content: str, response: dict, agent_type: str):
    embedding = await embed_text(content[:2000])
    key = f"cache:{agent_type}:{hash_content(content)}"

    self.client.hset(key, mapping={
        "agent_type": agent_type,
        "embedding": embedding,
        "response": json.dumps(response),
        "created_at": time.time(),
    })
    self.client.expire(key, 86400)  # 24h TTL

Similarity Thresholds

Threshold Distance Use Case

0.98-1.00 0.00-0.02 Nearly identical

0.95-0.98 0.02-0.05 Very similar

0.92-0.95 0.05-0.08 Similar (default)

0.85-0.92 0.08-0.15 Moderately similar

Multi-Level Lookup

async def get_llm_response(query: str, agent_type: str) -> dict: # L1: Exact match (in-memory LRU) cache_key = hash_content(query) if cache_key in lru_cache: return lru_cache[cache_key]

# L2: Semantic similarity (Redis)
similar = await semantic_cache.get(query, agent_type)
if similar:
    lru_cache[cache_key] = similar  # Promote to L1
    return similar

# L3/L4: LLM call with prompt caching
response = await llm.generate(query)

# Store in caches
await semantic_cache.set(query, response, agent_type)
lru_cache[cache_key] = response

return response

Redis 8.4+ Hybrid Search (FT.HYBRID)

Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.

Redis 8.4 native hybrid search

result = redis.execute_command( "FT.HYBRID", "llm_cache", "SEARCH", f"@agent_type:{{{agent_type}}}", "VSIM", "@embedding", "$query_vec", "KNN", "2", "K", "5", "COMBINE", "RRF", "4", "CONSTANT", "60", "PARAMS", "2", "query_vec", embedding_bytes )

Hybrid Search Benefits:

Single query for keyword + vector matching
RRF (Reciprocal Rank Fusion) combines scores intelligently
Better results than sequential filtering
BM25STD is now the default scorer for keyword matching

When to Use Hybrid:

Filtering by metadata (agent_type, tenant, category) + semantic similarity
Multi-tenant caches where exact tenant match is required
Combining keyword search with vector similarity

Key Decisions

Decision Recommendation

Threshold Start at 0.92, tune based on hit rate

TTL 24h for production

Embedding text-embedding-3-small (fast)

L1 size 1000-10000 entries

Scorer BM25STD (Redis 8+ default)

Hybrid Use FT.HYBRID for metadata + vector queries

Common Mistakes

Threshold too low (false positives)
No cache warming (cold start)
Missing metadata filters
Not promoting L2 hits to L1

Related Skills

prompt-caching
Provider-native caching
embeddings
Vector generation
cache-cost-tracking
Langfuse integration

Capability Details

redis-vector-cache

Keywords: redis, vector, embedding, similarity, cache Solves:

Cache LLM responses by semantic similarity
Reduce API costs with smart caching
Implement multi-level cache hierarchy

similarity-threshold

Keywords: threshold, similarity, tuning, cosine Solves:

Set appropriate similarity threshold
Balance hit rate vs accuracy
Tune cache performance

orchestkit-integration

Keywords: orchestkit, integration, roi, cost-savings Solves:

Integrate caching with OrchestKit
Calculate ROI for caching
Production implementation guide

cache-service

Keywords: service, implementation, template, production Solves:

Production cache service template
Complete implementation example
Redis integration code

hybrid-search

Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:

Combine semantic and keyword search
Filter cache by metadata with vector similarity
Use Redis 8.4 FT.HYBRID command
BM25STD scoring for keyword matching

semantic-caching

Safety Notice

Copy this and send it to your AI assistant to learn

Redis 8.4 native hybrid search

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns