RAG Auditor
Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.
Reference Files
| File | Contents | Load When |
|---|---|---|
references/retrieval-metrics.md | Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always |
references/generation-metrics.md | Groundedness, completeness, hallucination detection methods | Generation evaluation needed |
references/failure-taxonomy.md | RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed |
references/diagnostic-queries.md | Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |
Prerequisites
- Access to the RAG pipeline (or its outputs for post-hoc evaluation)
- A set of test queries with known-correct answers
- Understanding of the pipeline components (embedding model, retriever, generator)
Workflow
Phase 1: Pipeline Inventory
Document the RAG pipeline configuration:
- Document source — What documents are indexed? Format, count, size.
- Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
- Embedding — Model name and version, dimensionality.
- Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
- Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
- Generation — Model, prompt template, context window usage.
Phase 2: Design Evaluation Queries
Create a diverse set of test queries:
| Query Type | Purpose | Count |
|---|---|---|
| Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ |
| Multi-hop | Require combining info from multiple chunks | 5+ |
| Unanswerable | Not in the corpus — should abstain | 3+ |
| Ambiguous | Multiple valid interpretations | 3+ |
| Recent/updated | Test freshness | 2+ |
For each query, document the expected answer and the source chunk(s).
Phase 3: Evaluate Retrieval
For each test query, measure:
- Precision@K — Of the K retrieved chunks, how many are relevant?
- Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
- MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
- Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.
Phase 4: Evaluate Generation
For each test query with retrieved context:
- Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
- Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
- Hallucination detection — Identify specific claims not supported by context.
- Abstention — For unanswerable queries, does the model correctly say "I don't know"?
Phase 5: Diagnose Failures
For every incorrect or low-quality response, classify the root cause:
| Failure Type | Diagnosis | Indicator |
|---|---|---|
| Retrieval failure | Relevant chunks not retrieved | Low Recall@K |
| Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall |
| Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks |
| Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score |
| Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness |
| Hallucination | Model invents facts not in context | Claims not traceable to any chunk |
| Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |
Phase 6: Recommendations
Based on failure analysis, recommend specific improvements:
| Failure Pattern | Recommendation |
|---|---|
| Chunk boundary issues | Increase overlap, try semantic chunking |
| Low Precision@K | Reduce K, add reranking stage |
| Low Recall@K | Increase K, try hybrid search |
| Embedding mismatch | Try different embedding model, add query expansion |
| Hallucination | Strengthen grounding instruction in prompt, reduce temperature |
| Over-abstention | Soften abstention criteria in prompt |
Output Format
## RAG Audit Report
### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |
### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}
### Retrieval Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |
### Generation Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |
### Failure Analysis
| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |
### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}
### Sample Failures
#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}
Calibration Rules
- Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
- Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
- Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
- Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.
Error Handling
| Problem | Resolution |
|---|---|
| No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. |
| Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. |
| Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. |
| Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |
When NOT to Audit
Push back if:
- The pipeline hasn't been built yet — design it first, audit after
- The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
- The user wants to compare embedding models — that's a benchmark task, not an audit