Advanced Evaluation
Production-grade techniques for evaluating LLM outputs using LLMs as judges.
Evaluation Taxonomy
Direct Scoring
Single LLM rates one response on a defined scale.
-
Best for: Objective criteria (factual accuracy, instruction following)
-
Reliability: Moderate to high for well-defined criteria
-
Failure mode: Score calibration drift
Pairwise Comparison
LLM compares two responses and selects the better one.
-
Best for: Subjective preferences (tone, style, persuasiveness)
-
Reliability: Higher than direct scoring for preferences
-
Failure mode: Position bias, length bias
The Bias Landscape
Bias Description Mitigation
Position First-position responses favored Swap positions, majority vote
Length Longer = higher rating Explicit prompting to ignore length
Self-Enhancement Models rate own outputs higher Use different model for evaluation
Verbosity Detailed explanations favored Criteria-specific rubrics
Authority Confident tone rated higher Require evidence citation
Direct Scoring Implementation
You are an expert evaluator assessing response quality.
Task
Evaluate the following response against each criterion.
Original Prompt
{prompt}
Response to Evaluate
{response}
Criteria
{criteria with descriptions and weights}
Instructions
For each criterion:
- Find specific evidence in the response
- Score according to the rubric (1-{max} scale)
- Justify your score with evidence
- Suggest one specific improvement
Output Format
Respond with structured JSON containing scores, justifications, and summary.
Critical: Always require justification BEFORE the score. Improves reliability 15-25%.
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
-
First pass: A in first position, B in second
-
Second pass: B in first position, A in second
-
Consistency check: If passes disagree, return TIE
-
Final verdict: Consistent winner with averaged confidence
Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when genuinely equivalent
Rubric Generation
Components:
-
Level descriptions with clear boundaries
-
Observable characteristics for each level
-
Examples for each level
-
Edge case guidance
-
General scoring principles
Strictness levels:
-
Lenient: Lower bar, encourages iteration
-
Balanced: Typical production use
-
Strict: High-stakes or safety-critical
Decision Framework
Is there objective ground truth? ├── Yes → Direct Scoring │ (factual accuracy, instruction following) └── No → Is it a preference judgment? ├── Yes → Pairwise Comparison │ (tone, style, persuasiveness) └── No → Reference-based evaluation (summarization, translation)
Scaling Evaluation
Approach Use Case Trade-off
Panel of LLMs High-stakes decisions More expensive, more reliable
Hierarchical Large volumes Fast screening + careful edge cases
Human-in-loop Critical applications Best reliability, feedback loop
Guidelines
-
Always require justification before scores
-
Always swap positions in pairwise comparison
-
Match scale granularity to rubric specificity
-
Separate objective and subjective criteria
-
Include confidence scores calibrated to consistency
-
Define edge cases explicitly
-
Validate against human judgments