advanced-evaluation

Advanced Evaluation

Production-grade techniques for evaluating LLM outputs using LLMs as judges.

Evaluation Taxonomy

Direct Scoring

Single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following)
Reliability: Moderate to high for well-defined criteria
Failure mode: Score calibration drift

Pairwise Comparison

LLM compares two responses and selects the better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Reliability: Higher than direct scoring for preferences
Failure mode: Position bias, length bias

The Bias Landscape

Bias Description Mitigation

Position First-position responses favored Swap positions, majority vote

Length Longer = higher rating Explicit prompting to ignore length

Self-Enhancement Models rate own outputs higher Use different model for evaluation

Verbosity Detailed explanations favored Criteria-specific rubrics

Authority Confident tone rated higher Require evidence citation

Direct Scoring Implementation

You are an expert evaluator assessing response quality.

Task

Evaluate the following response against each criterion.

Original Prompt

{prompt}

Response to Evaluate

{response}

Criteria

{criteria with descriptions and weights}

Instructions

For each criterion:

Find specific evidence in the response
Score according to the rubric (1-{max} scale)
Justify your score with evidence
Suggest one specific improvement

Output Format

Respond with structured JSON containing scores, justifications, and summary.

Critical: Always require justification BEFORE the score. Improves reliability 15-25%.

Pairwise Comparison Implementation

Position Bias Mitigation Protocol:

First pass: A in first position, B in second
Second pass: B in first position, A in second
Consistency check: If passes disagree, return TIE
Final verdict: Consistent winner with averaged confidence

Critical Instructions

Do NOT prefer responses because they are longer
Do NOT prefer responses based on position (first vs second)
Focus ONLY on quality according to specified criteria
Ties are acceptable when genuinely equivalent

Rubric Generation

Components:

Level descriptions with clear boundaries
Observable characteristics for each level
Examples for each level
Edge case guidance
General scoring principles

Strictness levels:

Lenient: Lower bar, encourages iteration
Balanced: Typical production use
Strict: High-stakes or safety-critical

Decision Framework

Is there objective ground truth? ├── Yes → Direct Scoring │ (factual accuracy, instruction following) └── No → Is it a preference judgment? ├── Yes → Pairwise Comparison │ (tone, style, persuasiveness) └── No → Reference-based evaluation (summarization, translation)

Scaling Evaluation

Approach Use Case Trade-off

Panel of LLMs High-stakes decisions More expensive, more reliable

Hierarchical Large volumes Fast screening + careful edge cases

Human-in-loop Critical applications Best reliability, feedback loop

Guidelines

Always require justification before scores
Always swap positions in pairwise comparison
Match scale granularity to rubric specificity
Separate objective and subjective criteria
Include confidence scores calibrated to consistency
Define edge cases explicitly
Validate against human judgments