<usage_patterns>
-
Architecture Review: Run symbol searches on key interfaces to understand the dependency graph.
-
Plan Mode: Use this skill to populate the "Context" section of a Plan Mode artifact.
-
Refactoring: Identify all usages of a symbol before renaming or modifying it. </usage_patterns>
symbols "UserAuthentication"
Semantic Search:
search "authentication middleware logic"
</code_example>
RAG Evaluation
Overview
Systematic evaluation of RAG quality using retrieval and end-to-end metrics. Based on Claude Cookbooks patterns.
Evaluation Metrics
Retrieval Metrics (from .claude/tools/repo-rag/metrics.py ):
-
Precision: Proportion of retrieved chunks that are actually relevant
-
Formula: Precision = True Positives / Total Retrieved
-
High precision (0.8-1.0): System retrieves mostly relevant items
-
Recall: Completeness of retrieval - how many relevant items were found
-
Formula: Recall = True Positives / Total Correct
-
High recall (0.8-1.0): System finds most relevant items
-
F1 Score: Harmonic mean of precision and recall
-
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
-
Balanced measure when both precision and recall matter
-
MRR (Mean Reciprocal Rank): Measures ranking quality
-
Formula: MRR = 1 / rank of first correct item
-
High MRR (0.8-1.0): Correct items ranked first
End-to-End Metrics (from .claude/tools/repo-rag/evaluation.py ):
-
Accuracy (LLM-as-Judge): Overall correctness using Claude evaluation
-
Compares generated answer to correct answer
-
Focuses on substance and meaning, not exact wording
-
Checks for completeness and absence of contradictions
Evaluation Process
Create Evaluation Dataset:
{ "query": "How is user authentication implemented?", "correct_chunks": ["src/auth/middleware.ts", "src/auth/types.ts"], "correct_answer": "User authentication uses JWT tokens...", "category": "authentication" }
Run Retrieval Evaluation:
Using Python directly
from .claude.tools.repo_rag.metrics import evaluate_retrieval metrics = evaluate_retrieval(retrieved_chunks, correct_chunks) print(f"Precision: {metrics['precision']}, Recall: {metrics['recall']}, F1: {metrics['f1']}, MRR: {metrics['mrr']}")
Run End-to-End Evaluation:
Using Python directly
from .claude.tools.repo_rag.evaluation import evaluate_end_to_end result = evaluate_end_to_end(query, generated_answer, correct_answer) print(f"Correct: {result['is_correct']}, Explanation: {result['explanation']}")
Expected Performance
Based on Claude Cookbooks results:
-
Basic RAG: Precision 0.43, Recall 0.66, F1 0.52, MRR 0.74, Accuracy 71%
-
With Re-ranking: Precision 0.44, Recall 0.69, F1 0.54, MRR 0.87, Accuracy 81%
Best Practices
-
Separate Evaluation: Evaluate retrieval and end-to-end separately
-
Create Comprehensive Datasets: Cover common and edge cases
-
Evaluate Regularly: Run evaluations after codebase changes
-
Track Metrics Over Time: Monitor improvements
-
Use Both Metrics: Precision/Recall for retrieval, Accuracy for end-to-end
References
-
RAG Patterns Guide - Implementation patterns
-
Retrieval Metrics - Metric calculations
-
End-to-End Evaluation - LLM-as-judge
-
Evaluation Guide - Comprehensive evaluation guide
Memory Protocol (MANDATORY)
Before starting: Read .claude/context/memory/learnings.md
After completing:
-
New pattern -> .claude/context/memory/learnings.md
-
Issue found -> .claude/context/memory/issues.md
-
Decision made -> .claude/context/memory/decisions.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.