semantic-caching

Cache LLM responses by semantic similarity.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "semantic-caching" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-semantic-caching

Semantic Caching

Cache LLM responses by semantic similarity.

Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use redis:8 in Docker or any Redis 8+ deployment.

Cache Hierarchy

Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM) ~1ms ~10ms ~2s ~3s 100% save 100% save 90% save Full cost

Redis Semantic Cache

from redisvl.index import SearchIndex from redisvl.query import VectorQuery

class SemanticCacheService: def init(self, redis_url: str, threshold: float = 0.92): self.client = Redis.from_url(redis_url) self.threshold = threshold

async def get(self, content: str, agent_type: str) -> dict | None:
    embedding = await embed_text(content[:2000])

    query = VectorQuery(
        vector=embedding,
        vector_field_name="embedding",
        filter_expression=f"@agent_type:{{{agent_type}}}",
        num_results=1
    )

    results = self.index.query(query)

    if results:
        distance = float(results[0].get("vector_distance", 1.0))
        if distance <= (1 - self.threshold):
            return json.loads(results[0]["response"])

    return None

async def set(self, content: str, response: dict, agent_type: str):
    embedding = await embed_text(content[:2000])
    key = f"cache:{agent_type}:{hash_content(content)}"

    self.client.hset(key, mapping={
        "agent_type": agent_type,
        "embedding": embedding,
        "response": json.dumps(response),
        "created_at": time.time(),
    })
    self.client.expire(key, 86400)  # 24h TTL

Similarity Thresholds

Threshold Distance Use Case

0.98-1.00 0.00-0.02 Nearly identical

0.95-0.98 0.02-0.05 Very similar

0.92-0.95 0.05-0.08 Similar (default)

0.85-0.92 0.08-0.15 Moderately similar

Multi-Level Lookup

async def get_llm_response(query: str, agent_type: str) -> dict: # L1: Exact match (in-memory LRU) cache_key = hash_content(query) if cache_key in lru_cache: return lru_cache[cache_key]

# L2: Semantic similarity (Redis)
similar = await semantic_cache.get(query, agent_type)
if similar:
    lru_cache[cache_key] = similar  # Promote to L1
    return similar

# L3/L4: LLM call with prompt caching
response = await llm.generate(query)

# Store in caches
await semantic_cache.set(query, response, agent_type)
lru_cache[cache_key] = response

return response

Redis 8.4+ Hybrid Search (FT.HYBRID)

Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.

Redis 8.4 native hybrid search

result = redis.execute_command( "FT.HYBRID", "llm_cache", "SEARCH", f"@agent_type:{{{agent_type}}}", "VSIM", "@embedding", "$query_vec", "KNN", "2", "K", "5", "COMBINE", "RRF", "4", "CONSTANT", "60", "PARAMS", "2", "query_vec", embedding_bytes )

Hybrid Search Benefits:

  • Single query for keyword + vector matching

  • RRF (Reciprocal Rank Fusion) combines scores intelligently

  • Better results than sequential filtering

  • BM25STD is now the default scorer for keyword matching

When to Use Hybrid:

  • Filtering by metadata (agent_type, tenant, category) + semantic similarity

  • Multi-tenant caches where exact tenant match is required

  • Combining keyword search with vector similarity

Key Decisions

Decision Recommendation

Threshold Start at 0.92, tune based on hit rate

TTL 24h for production

Embedding text-embedding-3-small (fast)

L1 size 1000-10000 entries

Scorer BM25STD (Redis 8+ default)

Hybrid Use FT.HYBRID for metadata + vector queries

Common Mistakes

  • Threshold too low (false positives)

  • No cache warming (cold start)

  • Missing metadata filters

  • Not promoting L2 hits to L1

Related Skills

  • prompt-caching

  • Provider-native caching

  • embeddings

  • Vector generation

  • cache-cost-tracking

  • Langfuse integration

Capability Details

redis-vector-cache

Keywords: redis, vector, embedding, similarity, cache Solves:

  • Cache LLM responses by semantic similarity

  • Reduce API costs with smart caching

  • Implement multi-level cache hierarchy

similarity-threshold

Keywords: threshold, similarity, tuning, cosine Solves:

  • Set appropriate similarity threshold

  • Balance hit rate vs accuracy

  • Tune cache performance

orchestkit-integration

Keywords: orchestkit, integration, roi, cost-savings Solves:

  • Integrate caching with OrchestKit

  • Calculate ROI for caching

  • Production implementation guide

cache-service

Keywords: service, implementation, template, production Solves:

  • Production cache service template

  • Complete implementation example

  • Redis integration code

hybrid-search

Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:

  • Combine semantic and keyword search

  • Filter cache by metadata with vector similarity

  • Use Redis 8.4 FT.HYBRID command

  • BM25STD scoring for keyword matching

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review