LLM Evaluation
Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.
Quick Reference
LLM-as-Judge Pattern
async def evaluate_quality(input_text: str, output_text: str, dimension: str) -> float: response = await llm.chat([{ "role": "user", "content": f"""Evaluate for {dimension}. Score 1-10. Input: {input_text[:500]} Output: {output_text[:1000]} Respond with just the number.""" }]) return int(response.content.strip()) / 10
Quality Gate
QUALITY_THRESHOLD = 0.7
async def quality_gate(state: dict) -> dict: scores = await full_quality_assessment(state["input"], state["output"]) passed = scores["average"] >= QUALITY_THRESHOLD return {**state, "quality_passed": passed}
Hallucination Detection
async def detect_hallucination(context: str, output: str) -> dict: # Check if output contains claims not in context return {"has_hallucinations": bool, "unsupported_claims": []}
RAGAS Metrics ()
Metric Use Case Threshold
Faithfulness RAG grounding ≥ 0.8
Answer Relevancy Q&A systems ≥ 0.7
Context Precision Retrieval quality ≥ 0.7
Context Recall Retrieval completeness ≥ 0.7
Anti-Patterns (FORBIDDEN)
❌ NEVER use same model as judge and evaluated
output = await gpt4.complete(prompt) score = await gpt4.evaluate(output) # Same model!
❌ NEVER use single dimension
if relevance_score > 0.7: # Only checking one thing return "pass"
❌ NEVER set threshold too high
THRESHOLD = 0.95 # Blocks most content
✅ ALWAYS use different judge model
score = await gpt4_mini.evaluate(claude_output)
✅ ALWAYS use multiple dimensions
scores = await evaluate_all_dimensions(output) if scores["average"] > 0.7: return "pass"
Key Decisions
Decision Recommendation
Judge model GPT-5.2-mini or Claude Haiku 4.5
Threshold 0.7 for production, 0.6 for drafts
Dimensions 3-5 most relevant to use case
Sample size 50+ for reliable metrics
Detailed Documentation
Resource Description
references/evaluation-metrics.md RAGAS & LLM-as-judge metrics
examples/evaluation-patterns.md Complete evaluation examples
checklists/evaluation-checklist.md Setup and review checklists
scripts/evaluator-template.py Starter evaluation template
Related Skills
-
quality-gates
-
Workflow quality control
-
langfuse-observability
-
Tracking evaluation scores
-
agent-loops
-
Self-correcting with evaluation
Capability Details
llm-as-judge
Keywords: LLM judge, judge model, evaluation model, grader LLM Solves:
-
Use LLM to evaluate other LLM outputs
-
Implement judge prompts for quality
-
Configure evaluation criteria
ragas-metrics
Keywords: RAGAS, faithfulness, answer relevancy, context precision Solves:
-
Evaluate RAG with RAGAS metrics
-
Measure faithfulness and relevancy
-
Assess context precision and recall
hallucination-detection
Keywords: hallucination, factuality, grounded, verify facts Solves:
-
Detect hallucinations in LLM output
-
Verify factual accuracy
-
Implement grounding checks
quality-gates
Keywords: quality gate, threshold, pass/fail, evaluation gate Solves:
-
Implement quality thresholds
-
Block low-quality outputs
-
Configure multi-metric gates
batch-evaluation
Keywords: batch eval, dataset evaluation, bulk scoring, eval suite Solves:
-
Evaluate over golden datasets
-
Run batch evaluation pipelines
-
Generate evaluation reports
pairwise-comparison
Keywords: pairwise, A/B comparison, side-by-side, preference Solves:
-
Compare two model outputs
-
Implement preference ranking
-
Run A/B evaluations