RAG Implementer

Build production-ready retrieval-augmented generation systems.

Core Principle

RAG = Retrieval + Context Assembly + Generation

Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.

⚠️ Prerequisites & Cost Reality Check

STOP: Have You Validated the Need for RAG?

Before implementing RAG, confirm:

Problem validated - Completed product-strategist Phase 1 (problem discovery)
Users need AI search - Tested with simpler alternatives (see below)
ROI justified - Calculated cost vs benefit of RAG vs alternatives

Try These FIRST (Before RAG)

RAG is powerful but expensive. Try cheaper alternatives first:

1. FAQ Page / Documentation (1 day, $0)

Create well-organized FAQ or docs
Add search with Cmd+F
Works for: <50 common questions, static content
Test: Do users find answers? If yes, stop here.

2. Simple Keyword Search (2-3 days, $0-20/month)

Use Algolia, Typesense, or PostgreSQL full-text search
Good enough for 80% of use cases
Works for: <100k documents, keyword matching sufficient
Test: Do users get relevant results? If yes, stop here.

3. Manual Curation (Concierge MVP) (1 week, $0)

Manually answer user questions
Build FAQ from common questions
Works for: <100 users, validating if users want AI
Test: Do users value your answers enough to pay? If yes, consider RAG.

4. Simple Semantic Search (1 week, $30-50/month)

Use OpenAI embeddings + Postgres pgvector
Skip complex retrieval, re-ranking, etc.
Works for: <50k documents, basic semantic search
Test: Are embeddings better than keyword search? If no, stop here.

Cost Reality Check

Naive RAG (Prototype):

Time: 1-2 weeks
Cost: $50-150/month (vector DB + embeddings + API calls)
When: Prototype, <10k documents, proof of concept

Advanced RAG (Production):

Time: 3-4 weeks
Cost: $200-500/month (hybrid search, re-ranking, monitoring)
When: Production, 10k-1M documents, validated demand

Modular RAG (Enterprise):

Time: 6-8 weeks
Cost: $500-2000+/month (multiple KBs, specialized modules)
When: Enterprise, 1M+ documents, mission-critical

Decision Tree: Do You Really Need RAG?

Do users need to search your content?
│
├─ No → Don't build RAG ❌
│
└─ Yes
   ├─ <50 items? → FAQ page ✅ ($0)
   │
   └─ >50 items?
      ├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)
      │
      └─ Need semantic understanding?
         ├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)
         │
         └─ >50k docs?
            ├─ Validated with users? → Build RAG ✅
            └─ Not validated? → Test with Concierge MVP first ⚠️

Validation Checklist

Only proceed with RAG implementation if:

Tested simpler alternatives (FAQ, keyword search, manual curation)
Users confirmed they need AI-powered search (not just you think they do)
Calculated ROI: cost of RAG < value users get
Have >50k documents OR complex semantic search requirements
Budget: $200-500/month for infrastructure
Time: 3-4 weeks for production implementation

If any checkbox is unchecked: Go back to product-strategist or mvp-builder skills to validate first.

See also: PLAYBOOKS/validation-first-development.md for step-by-step validation process.

8-Phase RAG Implementation

Phase 1: Knowledge Base Design

Goal: Create well-structured knowledge foundation

Actions:

Map data sources (internal: docs, databases, APIs / external: web, feeds)
Filter noise, select authoritative content (prevent "data dump fallacy")
Define chunking strategy: semantic chunking based on structure
Add metadata: tags, timestamps, source identifiers, categories

Validation:

All data sources catalogued and prioritized
Data quality assessed (accuracy, completeness, freshness)
Chunking strategy tested with sample documents
Metadata schema validated for search effectiveness

Common Chunking Strategies:

Fixed-size: 500-1000 tokens, 50-100 token overlap
Semantic: By paragraph, section headers, or topic boundaries
Recursive: Split by structure (markdown headers, code blocks)

Phase 2: Embedding Strategy

Goal: Choose optimal embedding approach for semantic understanding

Actions:

Select embedding model: text-embedding-3-large (1536 dim) for general, domain-specific for specialized
Plan multi-modal needs (text, code, images, tables)
Decide on fine-tuning: use domain data if general embeddings underperform
Establish similarity benchmarks

Validation:

Embedding model benchmarked on domain data
Retrieval accuracy tested with known query-document pairs
Storage and compute costs validated

Model Selection:

General: OpenAI text-embedding-3-large, text-embedding-3-small
Code: code-search-babbage-code-001 or StarEncoder
Multilingual: multilingual-e5-large

Phase 3: Vector Store Architecture

Goal: Implement scalable vector database

Actions:

Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
Configure index: HNSW for speed, IVF for scale
Plan scalability: data growth and query volume
Implement backup, recovery, security

Validation:

Vector store benchmarked under expected load
Index optimized for retrieval speed and accuracy
Backup and recovery tested
Security controls implemented

Vector DB Decision:

Managed cloud → Pinecone
Self-hosted, feature-rich → Weaviate
Lightweight, local → Chroma
Cost-conscious → pgvector (Postgres extension)
High-performance → Qdrant

Phase 4: Retrieval Pipeline

Goal: Build sophisticated retrieval beyond simple similarity search

Actions:

Implement hybrid retrieval: semantic search + keyword (BM25)
Add query enhancement: expansion, reformulation, multi-query
Apply contextual filtering: metadata, temporal constraints, relevance ranking
Design for query types: factual (precision), analytical (breadth), creative (diversity)
Handle edge cases: no relevant results found

Advanced Techniques:

Re-ranking: Use cross-encoder after initial retrieval (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2)
Query routing: Route different query types to specialized strategies
Ensemble methods: Combine multiple retrieval approaches
Adaptive retrieval: Adjust top-k based on query complexity

Validation:

Retrieval accuracy tested across diverse query types
Hybrid retrieval outperforms single-method baselines
Query latency meets requirements (<500ms ideal)
Edge cases and fallbacks tested

Phase 5: Context Assembly

Goal: Transform retrieved chunks into optimal LLM context

Actions:

Rank and select: prioritize by relevance score, recency, source authority
Synthesize: merge related chunks, avoid redundancy
Compress: use LLMLingua or similar for token optimization
Mitigate "lost in the middle": place critical info at start/end
Adapt dynamically: adjust context based on conversation history

Context Engineering Integration:

Blend RAG results with system instructions and user prompts
Maintain conversation coherence across multi-turn interactions
Implement context persistence for follow-up queries
Balance context size vs. information density

Validation:

Context relevance validated against human judgments
Token optimization maintains accuracy
Multi-turn conversations maintain coherence
Assembly latency <200ms

Phase 6: Evaluation & Metrics

Goal: Measure RAG system performance comprehensively

Retrieval Quality:

Precision@K: Fraction of top-K results that are relevant
Recall@K: Fraction of relevant docs in top-K
MRR (Mean Reciprocal Rank): Average rank of first relevant result
NDCG: Ranking quality with graded relevance

Generation Quality:

Faithfulness: Generated content accuracy vs. sources
Answer Relevance: Response relevance to query
Context Utilization: How effectively LLM uses retrieved info
Hallucination Rate: Frequency of unsupported claims

System Performance:

End-to-End Latency: Query to answer (<3 seconds target)
Retrieval Latency: Time to retrieve and rank (<500ms)
Token Efficiency: Information density per token
Cost Per Query: Combined retrieval + generation costs

Validation:

Baseline metrics established
A/B testing framework for config comparisons
Automated evaluation pipeline deployed
Human evaluation protocols for ground truth

Phase 7: Production Deployment

Goal: Deploy with enterprise-grade reliability and security

Deployment:

Containerize with Docker/Kubernetes
Implement load balancing across RAG instances
Add caching for frequent queries
Graceful degradation: fallback to base model on component failure

Security:

Role-based access controls for knowledge base
Data masking and PII protection
Audit logging for compliance
Prompt injection defense

Monitoring:

Real-time metrics dashboard (latency, cost, accuracy)
Query analysis for patterns and failure modes
Cost tracking and optimization alerts
Performance profiling for bottlenecks

Validation:

Production handles expected traffic
Security prevents unauthorized access
Monitoring provides actionable insights
Incident response procedures tested

Phase 8: Continuous Improvement

Goal: Establish processes for ongoing enhancement

Data Pipeline:

Automated knowledge base updates (real-time or scheduled)
Quality monitoring: detect data drift and degradation
Source diversification: add new data sources
Feedback integration: user corrections and preferences

Model Evolution:

Evaluate and migrate to improved embeddings
Fine-tune on domain data regularly
Upgrade architecture: Naive → Advanced → Modular RAG
Expand multi-modal support (images, audio, video)

Optimization:

Analyze query patterns, optimize for common needs
Improve cache hit rates
Tune vector indices regularly
Balance performance vs. costs

Validation:

Automated improvement pipelines functioning
Performance trends show improvement
User satisfaction increasing
System adapts to changing needs

Key RAG Principles

1. Relevance Over Volume

Quality curation > massive datasets
Remove outdated/low-quality content continuously
Prioritize most relevant info to prevent "lost in the middle"

2. Semantic Understanding

Use embeddings for true semantic matching, not just keywords
Recognize query intent (factual, analytical, creative)
Adapt retrieval strategy based on context

3. Multi-Modal Intelligence

Handle text, images, code, tables, structured data
Enable cross-modal retrieval (text query → image results)
Preserve document structure and formatting

4. Temporal Awareness

Prioritize recent info for time-sensitive topics
Maintain historical access when relevant
Integrate real-time data feeds for dynamic domains

5. Transparency & Trust

Always provide source citations
Indicate confidence levels
Explain why specific information was selected

Standard RAG Response Format

{
  "answer": "Generated response incorporating retrieved information",
  "sources": [
    {
      "content": "Retrieved text chunk",
      "source": "Document/URL identifier",
      "relevance_score": 0.95,
      "chunk_id": "unique_identifier"
    }
  ],
  "confidence": 0.87,
  "retrieval_metadata": {
    "chunks_retrieved": 5,
    "retrieval_time_ms": 150,
    "generation_time_ms": 800
  }
}

Critical Success Rules

Non-Negotiable:

✅ Source attribution for every response
✅ Validate generated content against sources (prevent hallucination)
✅ Filter sensitive data before retrieval
✅ Respond within latency thresholds (<3 seconds)
✅ Monitor and optimize costs continuously
✅ Comply with security policies
✅ Graceful degradation on failures
✅ Comprehensive testing before production

Quality Gates:

Before Production: >85% accuracy on evaluation dataset
Ongoing: User satisfaction >4.0/5.0
Performance: 95th percentile <5 seconds
Reliability: 99.5% uptime
Cost: Within 10% of budget

Advanced Patterns

Modular RAG Architecture

Search Module: Query understanding and reformulation
Memory Module: Long-term conversation persistence
Routing Module: Query routing to specialized knowledge bases
Predict Module: Anticipatory pre-loading based on context

Hybrid RAG + Fine-tuning

RAG for dynamic, frequently changing knowledge
Fine-tuning for domain-specific reasoning patterns
Combine strengths for maximum effectiveness

Related Resources

Related Skills:

multi-agent-architect - For complex RAG orchestration
knowledge-graph-builder - For structured knowledge integration
performance-optimizer - For RAG system optimization

Related Patterns:

META/DECISION-FRAMEWORK.md - Vector DB and embedding selection
STANDARDS/architecture-patterns/rag-pattern.md - RAG architecture details (when created)

Related Playbooks:

PLAYBOOKS/deploy-rag-system.md - RAG deployment procedure (when created)

RAG Implementer

Safety Notice

Copy this and send it to your AI assistant to learn

RAG Implementer

Core Principle

⚠️ Prerequisites & Cost Reality Check

STOP: Have You Validated the Need for RAG?

Try These FIRST (Before RAG)

Cost Reality Check

Decision Tree: Do You Really Need RAG?

Validation Checklist

8-Phase RAG Implementation

Phase 1: Knowledge Base Design

Phase 2: Embedding Strategy

Phase 3: Vector Store Architecture

Phase 4: Retrieval Pipeline

Phase 5: Context Assembly

Phase 6: Evaluation & Metrics

Phase 7: Production Deployment

Phase 8: Continuous Improvement

Key RAG Principles

1. Relevance Over Volume

2. Semantic Understanding

3. Multi-Modal Intelligence

4. Temporal Awareness

5. Transparency & Trust

Standard RAG Response Format

Critical Success Rules

Advanced Patterns

Modular RAG Architecture

Hybrid RAG + Fine-tuning

Related Resources

Source Transparency

Related Skills

animation-designer

brand-designer

data-visualizer

3d-visualizer