RAG Pipeline Design
Guides the user through designing a Retrieval-Augmented Generation (RAG) pipeline. Based on "Principles of Building AI Agents" (Bhagwat & Gienow, 2025), Part V: RAG (Chapters 17-20).
When to use
Use this skill when the user needs to:
-
Design a RAG pipeline for an agent
-
Choose a vector database
-
Configure chunking, embedding, and retrieval
-
Evaluate whether RAG is even needed (vs. alternatives)
-
Tune an existing RAG pipeline for better quality
Instructions
Step 1: Do You Actually Need RAG?
Before building a pipeline, apply the principle: Start simple, check quality, get complex.
Use AskUserQuestion to assess:
RAG Decision Tree
Step 1: How large is your corpus?
- < 200 pages → Try full context loading first (Gemini 2M, Claude 200K)
- 200-10,000 pages → Consider agentic RAG (tools that query data) OR traditional RAG
- > 10,000 pages → Traditional RAG pipeline is likely needed
Step 2: What is the query pattern?
- Factual lookup ("What is X?") → RAG works well
- Analytical ("Compare X and Y across documents") → Agentic RAG may be better
- Conversational ("Tell me about...") → Either works
Step 3: How structured is the data?
- Highly structured (tables, databases) → Use tools/APIs, not RAG
- Semi-structured (markdown, HTML) → RAG with format-specific chunking
- Unstructured (PDFs, free text) → Traditional RAG
Recommended progression:
-
First, load entire corpus into a large context window
-
Second, write functions to query the dataset, give to agent as tools
-
Only if 1 and 2 fail on quality, build a RAG pipeline
If the user decides RAG is needed, proceed. Otherwise, recommend the simpler alternative.
Step 2: Chunking Strategy
Design how documents are split into retrievable pieces:
Chunking Strategy
Method
| Strategy | Best For | Description |
|---|---|---|
| Recursive | General text | Splits by paragraph, then sentence, then character |
| Token-aware | LLM optimization | Splits by token count, respects model limits |
| Format-specific | Markdown/HTML/JSON | Uses document structure (headers, tags, keys) |
| Semantic | High quality needs | Uses LLM to identify natural topic boundaries |
Selected: [Strategy]
Parameters
| Parameter | Value | Rationale |
|---|---|---|
| Chunk size | [256-1024 tokens] | Balance: smaller = more precise, larger = more context |
| Overlap | [50-200 tokens] | Prevents losing context at chunk boundaries |
| Metadata | [title, source, date, section, page] | Enables filtered retrieval |
Document-Specific Rules
| Document Type | Chunking Rule |
|---|---|
| [Markdown docs] | Split on ## headers, keep header as metadata |
| [PDFs] | Page-based with overlap, extract title/section |
| [Code files] | Function/class-level chunks |
| [Chat logs] | Message groups of [N] turns |
Step 3: Embedding Configuration
Choose how chunks become vectors:
Embedding
Model Selection
| Model | Dimensions | Quality | Cost | Speed |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | High | $0.13/M tokens | Fast |
| OpenAI text-embedding-3-small | 1536 | Good | $0.02/M tokens | Fast |
| Voyage voyage-3 | 1024 | High | $0.06/M tokens | Fast |
| Cohere embed-v3 | 1024 | High | $0.10/M tokens | Fast |
| Local (e5-large, BGE) | 1024 | Good | Free (compute) | Varies |
Selected: [Model]
Indexing
| Parameter | Value |
|---|---|
| Dimensions | [From model] |
| Similarity metric | Cosine (most common) |
| Index type | HNSW (default, good balance of speed/accuracy) |
Step 4: Vector Database Selection
Apply the principle: Prevent infra sprawl — vector DB choice is mostly commoditized.
Use AskUserQuestion :
Vector Database
Decision Matrix
| Option | When to Choose | Pros | Cons |
|---|---|---|---|
| pgvector (Postgres extension) | Already using Postgres | No new infra, familiar SQL, metadata filtering | May need tuning at scale |
| Pinecone (managed) | New project, want simplicity | Fully managed, fast, scalable | Additional service + cost |
| Chroma (open-source) | Local dev, small scale | Free, easy setup | Self-host in production |
| Cloud-native (Cloudflare, DataStax) | Already on that cloud | Integrated billing, low latency | Vendor lock-in |
Selected: [Database] Rationale: [Why]
Step 5: Retrieval Configuration
Design how the agent queries the vector store:
Retrieval
Query Strategy
| Parameter | Value | Rationale |
|---|---|---|
| topK | [3-10] | Number of chunks to retrieve |
| similarityThreshold | [0.7-0.9] | Min relevance to include |
| reranking | [Yes/No] | Post-retrieval quality boost |
Hybrid Queries
Combine vector similarity with metadata filters:
| Filter | Type | Example |
|---|---|---|
| Date range | Metadata | Only docs from last 30 days |
| Category | Metadata | Only "technical" documents |
| Source | Metadata | Only from "docs.example.com" |
| User access | Metadata | Only docs user has permission to see |
Reranking (Optional)
- When to use: Quality matters more than latency
- How: Retrieve topK * 3 candidates, rerank with a cross-encoder, return topK
- Models: Cohere Rerank, bge-reranker, cross-encoder/ms-marco
- Cost: More expensive per query, but runs only on candidates (not full corpus)
Query Transformation (Optional)
- HyDE: Generate a hypothetical answer, use it as the search query
- Multi-query: Generate multiple query variations, merge results
- Step-back: Abstract the query to a higher level, then search
Step 6: Pipeline Architecture
Bring it all together:
RAG Pipeline
Ingestion Pipeline
- Load documents from [source]
- Chunk using [strategy] with [size] tokens, [overlap] overlap
- Enrich metadata: source, date, category, section
- Embed using [model]
- Upsert into [vector DB]
- Schedule: [On change / Nightly / Manual]
Query Pipeline
- Receive user query
- Transform query (optional: HyDE, multi-query)
- Embed query using [same model as ingestion]
- Search vector DB: topK=[N], filters=[metadata filters]
- Rerank results (optional)
- Inject top chunks into LLM context as <retrieved_documents>
- Generate response with source attribution
Architecture Diagram
graph LR subgraph Ingestion Docs[Documents] --> Chunk[Chunker] Chunk --> Embed[Embedder] Embed --> Store[(Vector DB)] end subgraph Query User[User Query] --> QEmbed[Query Embedder] QEmbed --> Search[Similarity Search] Store --> Search Search --> Rerank[Reranker] Rerank --> LLM[LLM + Context] LLM --> Response[Response] end
Step 7: Quality Checklist
RAG Quality Checklist
Retrieval Quality
- Relevant documents consistently in top-K results
- Metadata filters working correctly
- No duplicate chunks in results
- Chunk size balances precision vs. context
Generation Quality
- Responses are grounded in retrieved documents
- Source attribution is accurate
- Agent says "I don't know" when no relevant chunks found
- No hallucination beyond retrieved context
Operational
- Ingestion pipeline runs on schedule
- New documents are available within [SLA]
- Vector DB latency < [target]ms
- Embedding costs within budget
Step 8: Summarize and Offer Next Steps
Present all findings to the user as a structured summary in the conversation (including the pipeline diagram). Do NOT write to .specs/ — this skill works directly.
Use AskUserQuestion to offer:
-
Implement pipeline — scaffold ingestion and query code
-
Skip RAG — if the decision tree said RAG isn't needed, help with the alternative (full context or agentic tools)
-
Comprehensive design — run agent:design to cover all areas with a spec
Arguments
- <args>
- Optional description of the knowledge domain or path to existing RAG code
Examples:
-
agent:rag documentation search — design RAG for a docs search agent
-
agent:rag src/rag/ — review and tune existing RAG pipeline
-
agent:rag — start fresh