LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
-
Measuring LLM application performance systematically
-
Comparing different models or prompts
-
Detecting performance regressions before deployment
-
Validating improvements from prompt changes
-
Building confidence in production systems
-
Establishing baselines and tracking progress over time
-
Debugging unexpected model behavior
Core Evaluation Types
- Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
-
BLEU: N-gram overlap (translation)
-
ROUGE: Recall-oriented (summarization)
-
METEOR: Semantic similarity
-
BERTScore: Embedding-based similarity
-
Perplexity: Language model confidence
Classification:
-
Accuracy: Percentage correct
-
Precision/Recall/F1: Class-specific performance
-
Confusion Matrix: Error patterns
-
AUC-ROC: Ranking quality
Retrieval (RAG):
-
MRR: Mean Reciprocal Rank
-
NDCG: Normalized Discounted Cumulative Gain
-
Precision@K: Relevant in top K
-
Recall@K: Coverage in top K
- Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
-
Accuracy: Factual correctness
-
Coherence: Logical flow
-
Relevance: Answers the question
-
Fluency: Natural language quality
-
Safety: No harmful content
-
Helpfulness: Useful to the user
- LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
-
Pointwise: Score individual responses
-
Pairwise: Compare two responses
-
Reference-based: Compare to gold standard
-
Reference-free: Judge without ground truth
Quick Start
from llm_eval import EvaluationSuite, Metric
Define evaluation suite
suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ])
Prepare test cases
test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... more test cases ]
Run evaluation
results = suite.evaluate( model=your_model, test_cases=test_cases )
print(f"Overall Accuracy: {results.metrics['accuracy']}") print(f"BLEU Score: {results.metrics['bleu']}")
Automated Metrics Implementation
BLEU Score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis): """Calculate BLEU score between reference and hypothesis.""" smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)
Usage
bleu = calculate_bleu( reference="The cat sat on the mat", hypothesis="A cat is sitting on the mat" )
ROUGE Score
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis): """Calculate ROUGE scores.""" scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}
BERTScore
from bert_score import score
def calculate_bertscore(references, hypotheses): """Calculate BERTScore using pre-trained BERT.""" P, R, F1 = score( hypotheses, references, lang='en', model_type='microsoft/deberta-xlarge-mnli' )
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}
Custom Metrics
def calculate_groundedness(response, context): """Check if response is grounded in provided context.""" # Use NLI model to check entailment from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text): """Measure toxicity in generated text.""" from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim, knowledge_base): """Verify factual claims against knowledge base.""" # Implementation depends on your knowledge base # Could use retrieval + NLI, or fact-checking API pass
LLM-as-Judge Patterns
Single Output Evaluation
def llm_judge_quality(response, question): """Use GPT-5 to judge response quality.""" prompt = f"""Rate the following response on a scale of 1-10 for:
- Accuracy (factually correct)
- Helpfulness (answers the question)
- Clarity (well-written and understandable)
Question: {question} Response: {response}
Provide ratings in JSON format: {{ "accuracy": <1-10>, "helpfulness": <1-10>, "clarity": <1-10>, "reasoning": "<brief explanation>" }} """
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)
Pairwise Comparison
def compare_responses(question, response_a, response_b): """Compare two responses using LLM judge.""" prompt = f"""Compare these two responses to the question and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better and why? Consider accuracy, helpfulness, and clarity.
Answer with JSON: {{ "winner": "A" or "B" or "tie", "reasoning": "<explanation>", "confidence": <1-10> }} """
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)
Human Evaluation Frameworks
Annotation Guidelines
class AnnotationTask: """Structure for human annotation task."""
def __init__(self, response, question, context=None):
self.response = response
self.question = question
self.context = context
def get_annotation_form(self):
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}
Inter-Rater Agreement
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores): """Calculate inter-rater agreement.""" kappa = cohen_kappa_score(rater1_scores, rater2_scores)
interpretation = {
kappa < 0: "Poor",
kappa < 0.2: "Slight",
kappa < 0.4: "Fair",
kappa < 0.6: "Moderate",
kappa < 0.8: "Substantial",
kappa <= 1.0: "Almost Perfect"
}
return {
"kappa": kappa,
"interpretation": interpretation[True]
}
A/B Testing
Statistical Testing Framework
from scipy import stats import numpy as np
class ABTest: def init(self, variant_a_name="A", variant_b_name="B"): self.variant_a = {"name": variant_a_name, "scores": []} self.variant_b = {"name": variant_b_name, "scores": []}
def add_result(self, variant, score):
"""Add evaluation result for a variant."""
if variant == "A":
self.variant_a["scores"].append(score)
else:
self.variant_b["scores"].append(score)
def analyze(self, alpha=0.05):
"""Perform statistical analysis."""
a_scores = self.variant_a["scores"]
b_scores = self.variant_b["scores"]
# T-test
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self.interpret_cohens_d(cohens_d),
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
}
@staticmethod
def interpret_cohens_d(d):
"""Interpret Cohen's d effect size."""
abs_d = abs(d)
if abs_d < 0.2:
return "negligible"
elif abs_d < 0.5:
return "small"
elif abs_d < 0.8:
return "medium"
else:
return "large"
Regression Testing
Regression Detection
class RegressionDetector: def init(self, baseline_results, threshold=0.05): self.baseline = baseline_results self.threshold = threshold
def check_for_regression(self, new_results):
"""Detect if new results show regression."""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# Calculate relative change
relative_change = (new_score - baseline_score) / baseline_score
# Flag if significant decrease
if relative_change < -self.threshold:
regressions.append({
"metric": metric,
"baseline": baseline_score,
"current": new_score,
"change": relative_change
})
return {
"has_regression": len(regressions) > 0,
"regressions": regressions
}
Benchmarking
Running Benchmarks
class BenchmarkRunner: def init(self, benchmark_dataset): self.dataset = benchmark_dataset
def run_benchmark(self, model, metrics):
"""Run model on benchmark and calculate metrics."""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# Generate prediction
prediction = model.predict(example["input"])
# Calculate each metric
for metric in metrics:
score = metric.calculate(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# Aggregate results
return {
metric: {
"mean": np.mean(scores),
"std": np.std(scores),
"min": min(scores),
"max": max(scores)
}
for metric, scores in results.items()
}
Resources
-
references/metrics.md: Comprehensive metric guide
-
references/human-evaluation.md: Annotation best practices
-
references/benchmarking.md: Standard benchmarks
-
references/a-b-testing.md: Statistical testing guide
-
references/regression-testing.md: CI/CD integration
-
assets/evaluation-framework.py: Complete evaluation harness
-
assets/benchmark-dataset.jsonl: Example datasets
-
scripts/evaluate-model.py: Automated evaluation runner
Best Practices
-
Multiple Metrics: Use diverse metrics for comprehensive view
-
Representative Data: Test on real-world, diverse examples
-
Baselines: Always compare against baseline performance
-
Statistical Rigor: Use proper statistical tests for comparisons
-
Continuous Evaluation: Integrate into CI/CD pipeline
-
Human Validation: Combine automated metrics with human judgment
-
Error Analysis: Investigate failures to understand weaknesses
-
Version Control: Track evaluation results over time
Common Pitfalls
-
Single Metric Obsession: Optimizing for one metric at the expense of others
-
Small Sample Size: Drawing conclusions from too few examples
-
Data Contamination: Testing on training data
-
Ignoring Variance: Not accounting for statistical uncertainty
-
Metric Mismatch: Using metrics not aligned with business goals