<usage_patterns>

Architecture Review: Run symbol searches on key interfaces to understand the dependency graph.
Plan Mode: Use this skill to populate the "Context" section of a Plan Mode artifact.
Refactoring: Identify all usages of a symbol before renaming or modifying it. </usage_patterns>

symbols "UserAuthentication"

Semantic Search:

search "authentication middleware logic"

</code_example>

RAG Evaluation

Overview

Systematic evaluation of RAG quality using retrieval and end-to-end metrics. Based on Claude Cookbooks patterns.

Evaluation Metrics

Retrieval Metrics (from .claude/tools/repo-rag/metrics.py ):

Precision: Proportion of retrieved chunks that are actually relevant
Formula: Precision = True Positives / Total Retrieved
High precision (0.8-1.0): System retrieves mostly relevant items
Recall: Completeness of retrieval - how many relevant items were found
Formula: Recall = True Positives / Total Correct
High recall (0.8-1.0): System finds most relevant items
F1 Score: Harmonic mean of precision and recall
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
Balanced measure when both precision and recall matter
MRR (Mean Reciprocal Rank): Measures ranking quality
Formula: MRR = 1 / rank of first correct item
High MRR (0.8-1.0): Correct items ranked first

End-to-End Metrics (from .claude/tools/repo-rag/evaluation.py ):

Accuracy (LLM-as-Judge): Overall correctness using Claude evaluation
Compares generated answer to correct answer
Focuses on substance and meaning, not exact wording
Checks for completeness and absence of contradictions

Evaluation Process

Create Evaluation Dataset:

{ "query": "How is user authentication implemented?", "correct_chunks": ["src/auth/middleware.ts", "src/auth/types.ts"], "correct_answer": "User authentication uses JWT tokens...", "category": "authentication" }

Run Retrieval Evaluation:

Using Python directly

from .claude.tools.repo_rag.metrics import evaluate_retrieval metrics = evaluate_retrieval(retrieved_chunks, correct_chunks) print(f"Precision: {metrics['precision']}, Recall: {metrics['recall']}, F1: {metrics['f1']}, MRR: {metrics['mrr']}")

Run End-to-End Evaluation:

Using Python directly

from .claude.tools.repo_rag.evaluation import evaluate_end_to_end result = evaluate_end_to_end(query, generated_answer, correct_answer) print(f"Correct: {result['is_correct']}, Explanation: {result['explanation']}")

Expected Performance

Based on Claude Cookbooks results:

Basic RAG: Precision 0.43, Recall 0.66, F1 0.52, MRR 0.74, Accuracy 71%
With Re-ranking: Precision 0.44, Recall 0.69, F1 0.54, MRR 0.87, Accuracy 81%

Best Practices

Separate Evaluation: Evaluate retrieval and end-to-end separately
Create Comprehensive Datasets: Cover common and edge cases
Evaluate Regularly: Run evaluations after codebase changes
Track Metrics Over Time: Monitor improvements
Use Both Metrics: Precision/Recall for retrieval, Accuracy for end-to-end

References

RAG Patterns Guide - Implementation patterns
Retrieval Metrics - Metric calculations
End-to-End Evaluation - LLM-as-judge
Evaluation Guide - Comprehensive evaluation guide

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

New pattern -> .claude/context/memory/learnings.md
Issue found -> .claude/context/memory/issues.md
Decision made -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

repo-rag

Safety Notice

Copy this and send it to your AI assistant to learn

Using Python directly

Using Python directly

Source Transparency

Related Skills

filesystem

slack-notifications

chrome-browser

text-to-sql