RAG & Search Engineering — Complete Reference
Build production-grade retrieval systems with hybrid search, grounded generation, and measurable quality.
This skill covers:
-
RAG: Chunking, contextual retrieval, grounding, adaptive/self-correcting systems
-
Search: BM25, vector search, hybrid fusion, ranking pipelines
-
Evaluation: recall@k, nDCG, MRR, groundedness metrics
Modern Best Practices (Jan 2026):
-
Separate retrieval quality from answer quality; evaluate both (RAG: https://arxiv.org/abs/2005.11401).
-
Default to hybrid retrieval (sparse + dense) with reranking when precision matters (DPR: https://arxiv.org/abs/2004.04906).
-
Use a failure taxonomy to debug systematically (Seven Failure Points in RAG: https://arxiv.org/abs/2401.05856).
-
Treat freshness/invalidation as first-class; staleness is a correctness bug, not a UX issue.
-
Add grounding gates: answerability checks, citation coverage checks, and refusal-on-missing-context defaults.
-
Threat-model RAG: retrieved text is untrusted input (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).
Default posture: deterministic pipeline, bounded context, explicit failure handling, and telemetry for every stage.
Scope note: For prompt structure and output contracts used in the generation phase, see ai-prompt-engineering.
Quick Reference
Task Tool/Framework Command/Pattern When to Use
Decide RAG vs alternatives Decision framework RAG if: freshness + citations + corpus size; else: fine-tune/caching Avoid unnecessary retrieval latency/complexity
Chunking & parsing Chunker + parser Start simple; add structure-aware chunking per doc type Ingestion for docs, code, tables, PDFs
Retrieval Sparse + dense (hybrid) Fusion (e.g., RRF) + metadata filters + top-k tuning Mixed query styles; high recall requirements
Precision boost Reranker Cross-encoder/LLM rerank of top-k candidates When top-k contains near-misses/noise
Grounding Output contract + citations Quote/ID citations; answerability gate; refuse on missing evidence Compliance, trust, and auditability
Evaluation Offline + online eval Retrieval metrics + answer metrics + regression tests Prevent silent regressions and staleness failures
Decision Tree: RAG Architecture Selection
Building RAG system: [Architecture Path] ├─ Document type? │ ├─ Page/section-structured? → Structure-aware chunking (pages/sections + metadata) │ ├─ Technical docs/code? → Structure-aware + code-aware chunking (symbols, headers) │ └─ Simple content? → Fixed-size token chunking with overlap (baseline) │ ├─ Retrieval accuracy low? │ ├─ Query ambiguity? → Query rewriting + multi-query expansion + filters │ ├─ Noisy results? → Add reranker + better metadata filters │ └─ Mixed queries? → Hybrid retrieval (sparse + dense) + reranking │ ├─ Dataset size? │ ├─ <100k chunks? → Flat index (exact search) │ ├─ 100k-10M? → HNSW (low latency) │ └─ >10M? → IVF/ScaNN/DiskANN (scalable) │ └─ Production quality? └─ Add: ACLs, freshness/invalidation, eval gates, and telemetry (end-to-end)
Core Concepts (Vendor-Agnostic)
-
Pipeline stages: ingest → chunk → embed → index → retrieve → rerank → pack context → generate → verify.
-
Two evaluation planes: retrieval relevance (did we fetch the right evidence?) vs generation fidelity (did we use it correctly?).
-
Freshness model: staleness budget, invalidation triggers, and rebuild strategy (incremental vs full).
-
Trust boundaries: retrieved content is untrusted; apply the same rigor as user input (OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/).
Implementation Practices (Tooling Examples)
-
Use a retrieval API contract: query, filters, top_k, trace_id, and returned evidence IDs.
-
Instrument each stage with tracing/metrics (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).
-
Add caches deliberately: embeddings cache, retrieval cache (query+filters), and response cache (with invalidation).
Do / Avoid
Do
-
Do keep retrieval deterministic: fixed top_k, stable ranking, explicit filters.
-
Do enforce document-level ACLs at retrieval time (not only at generation time).
-
Do include citations with stable IDs and verify citation coverage in tests.
Avoid
-
Avoid shipping RAG without a test set and regression gate.
-
Avoid "stuff everything" context packing; it increases cost and can reduce accuracy.
-
Avoid mixing corpora without metadata and tenant isolation.
When to Use This Skill
Use this skill when the user asks:
-
"Help me design a RAG pipeline."
-
"How should I chunk this document?"
-
"Optimize retrieval for my use case."
-
"My RAG system is hallucinating — fix it."
-
"Choose the right vector database / index type."
-
"Create a RAG evaluation framework."
-
"Debug why retrieval gives irrelevant results."
Tool/Model Recommendation Protocol
When users ask for vendor/model/framework recommendations, validate claims against current primary sources.
Triggers
-
"What's the best vector database for [use case]?"
-
"What should I use for [chunking/embedding/reranking]?"
-
"What's the latest in RAG development?"
-
"Current best practices for [retrieval/grounding/evaluation]?"
-
"Is [Pinecone/Qdrant/Chroma] still relevant in 2026?"
-
"[Vector DB A] vs [Vector DB B]?"
-
"Best embedding model for [use case]?"
-
"What RAG framework should I use?"
Required Checks
-
Read data/sources.json and start from sources with "add_as_web_search": true .
-
Verify 1-2 primary docs per recommendation (release notes, benchmarks, docs).
-
If browsing isn't available, state assumptions and give a verification checklist.
What to Report
After checking, provide:
-
Current landscape: What vector DBs/embeddings are popular NOW (not 6 months ago)
-
Emerging trends: Techniques gaining traction (late interaction, agentic RAG, graph RAG)
-
Deprecated/declining: Approaches or tools losing relevance
-
Recommendation: Based on fresh data, not just static knowledge
Example Topics (verify with current sources)
-
Vector databases (Pinecone, Qdrant, Weaviate, Milvus, pgvector, LanceDB)
-
Embedding models (OpenAI, Cohere, Voyage AI, Jina, Sentence Transformers)
-
Reranking (Cohere Rerank, Jina Reranker, FlashRank, RankGPT)
-
RAG frameworks (LlamaIndex, LangChain, Haystack, txtai)
-
Advanced RAG (contextual retrieval, agentic RAG, graph RAG, CRAG)
-
Evaluation (RAGAS, TruLens, DeepEval, BEIR)
Related Skills
For adjacent topics, reference these skills:
-
ai-llm - Prompting, fine-tuning, instruction datasets
-
ai-agents - Agentic RAG workflows and tool routing
-
ai-llm-inference - Serving performance, quantization, batching
-
ai-mlops - Deployment, monitoring, security, privacy, and governance
-
ai-prompt-engineering - Prompt patterns for RAG generation phase
Templates
System Design (Start Here)
- RAG System Design
Chunking & Ingestion
-
Basic Chunking
-
Code Chunking
-
Long Document Chunking
Embedding & Indexing
-
Index Configuration
-
Metadata Schema
Retrieval & Reranking
-
Retrieval Pipeline
-
Hybrid Search
-
Reranking
-
Ranking Pipeline
-
Reranker
Context Packaging & Grounding
-
Context Packing
-
Grounding
Evaluation
-
RAG Evaluation
-
RAG Test Set
-
Search Evaluation
-
Search Test Set
Search Configuration
-
BM25 Configuration
-
HNSW Configuration
-
IVF Configuration
-
Hybrid Configuration
Query Rewriting
- Query Rewrite
Navigation
Resources
-
references/advanced-rag-patterns.md
-
references/agentic-rag-patterns.md
-
references/bm25-tuning.md
-
references/chunking-patterns.md
-
references/chunking-strategies.md
-
references/rag-evaluation-guide.md
-
references/rag-troubleshooting.md
-
references/contextual-retrieval-guide.md
-
references/distributed-search-slos.md
-
references/grounding-checklists.md
-
references/hybrid-fusion-patterns.md
-
references/index-selection-guide.md
-
references/multilingual-domain-patterns.md
-
references/pipeline-architecture.md
-
references/query-rewriting-patterns.md
-
references/ranking-pipeline-guide.md
-
references/retrieval-patterns.md
-
references/search-debugging.md
-
references/search-evaluation-guide.md
-
references/user-feedback-learning.md
-
references/vector-search-patterns.md
-
references/graph-rag-patterns.md
-
references/embedding-model-guide.md
-
references/rag-caching-patterns.md
Templates
-
assets/context/template-context-packing.md
-
assets/context/template-grounding.md
-
assets/design/rag-system-design.md
-
assets/chunking/template-basic-chunking.md
-
assets/chunking/template-code-chunking.md
-
assets/chunking/template-long-doc-chunking.md
-
assets/retrieval/template-retrieval-pipeline.md
-
assets/retrieval/template-hybrid-search.md
-
assets/retrieval/template-reranking.md
-
assets/eval/template-rag-eval.md
-
assets/eval/template-rag-testset.jsonl
-
assets/eval/template-search-eval.md
-
assets/eval/template-search-testset.jsonl
-
assets/indexing/template-index-config.md
-
assets/indexing/template-metadata-schema.md
-
assets/query/template-query-rewrite.md
-
assets/ranking/template-ranking-pipeline.md
-
assets/ranking/template-reranker.md
-
assets/search/template-bm25-config.md
-
assets/search/template-hnsw-config.md
-
assets/search/template-ivf-config.md
-
assets/search/template-hybrid-config.md
Data
- data/sources.json — Curated external references
Use this skill whenever the user needs retrieval-augmented system design or debugging, not prompt work or deployment.
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.