RAG Auditor

Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.

Reference Files

File	Contents	Load When
`references/retrieval-metrics.md`	Precision@K, Recall@K, MRR, NDCG definitions and calculation	Always
`references/generation-metrics.md`	Groundedness, completeness, hallucination detection methods	Generation evaluation needed
`references/failure-taxonomy.md`	RAG failure categories: retrieval, generation, chunking, embedding	Failure diagnosis needed
`references/diagnostic-queries.md`	Designing evaluation query sets, known-answer questions, difficulty levels	Evaluation setup

Prerequisites

Access to the RAG pipeline (or its outputs for post-hoc evaluation)
A set of test queries with known-correct answers
Understanding of the pipeline components (embedding model, retriever, generator)

Workflow

Phase 1: Pipeline Inventory

Document the RAG pipeline configuration:

Document source — What documents are indexed? Format, count, size.
Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
Embedding — Model name and version, dimensionality.
Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
Generation — Model, prompt template, context window usage.

Phase 2: Design Evaluation Queries

Create a diverse set of test queries:

Query Type	Purpose	Count
Known-answer (factoid)	Measure retrieval + generation accuracy	10+
Multi-hop	Require combining info from multiple chunks	5+
Unanswerable	Not in the corpus — should abstain	3+
Ambiguous	Multiple valid interpretations	3+
Recent/updated	Test freshness	2+

For each query, document the expected answer and the source chunk(s).

Phase 3: Evaluate Retrieval

For each test query, measure:

Precision@K — Of the K retrieved chunks, how many are relevant?
Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.

Phase 4: Evaluate Generation

For each test query with retrieved context:

Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
Hallucination detection — Identify specific claims not supported by context.
Abstention — For unanswerable queries, does the model correctly say "I don't know"?

Phase 5: Diagnose Failures

For every incorrect or low-quality response, classify the root cause:

Failure Type	Diagnosis	Indicator
Retrieval failure	Relevant chunks not retrieved	Low Recall@K
Ranking failure	Relevant chunk retrieved but ranked low	Low MRR, high Recall
Chunk boundary issue	Answer split across chunk boundaries	Partial matches in multiple chunks
Embedding mismatch	Query semantics don't match chunk embeddings	Relevant chunk has low similarity score
Generation failure	Correct context but wrong answer	High retrieval scores, low groundedness
Hallucination	Model invents facts not in context	Claims not traceable to any chunk
Over-abstention	Model refuses to answer when context is sufficient	Unanswered with relevant context present

Phase 6: Recommendations

Based on failure analysis, recommend specific improvements:

Failure Pattern	Recommendation
Chunk boundary issues	Increase overlap, try semantic chunking
Low Precision@K	Reduce K, add reranking stage
Low Recall@K	Increase K, try hybrid search
Embedding mismatch	Try different embedding model, add query expansion
Hallucination	Strengthen grounding instruction in prompt, reduce temperature
Over-abstention	Soften abstention criteria in prompt

Output Format

## RAG Audit Report

### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |

### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}

### Retrieval Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |

### Generation Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |

### Failure Analysis

| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |

### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}

### Sample Failures

#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}

Calibration Rules

Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.

Error Handling

Problem	Resolution
No known-answer queries available	Help design them from the document corpus. Pick 10 facts and formulate questions.
Pipeline access not available	Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples.
Corpus is too large to review	Sample-based evaluation. Select representative documents and generate queries from them.
Multiple failure types co-exist	Address retrieval failures first. Generation quality cannot exceed retrieval quality.

When NOT to Audit

Push back if:

The pipeline hasn't been built yet — design it first, audit after
The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
The user wants to compare embedding models — that's a benchmark task, not an audit

rag-auditor

Safety Notice

Copy this and send it to your AI assistant to learn