advanced-evaluation

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "advanced-evaluation" with this command: npx skills add shipshitdev/library/shipshitdev-library-advanced-evaluation

Advanced Evaluation

LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.

When to Activate

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards
  • Debugging inconsistent evaluation results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation

Core Concepts

Evaluation Taxonomy

Direct Scoring: Single LLM rates one response on a defined scale.

  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria

Pairwise Comparison: LLM compares two responses and selects better one.

  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences

Known Biases

BiasDescriptionMitigation
PositionFirst-position preferenceSwap positions, check consistency
LengthLonger = higher scoresExplicit prompting, length-normalized scoring
Self-EnhancementModels rate own outputs higherUse different model for evaluation
VerbosityUnnecessary detail rated higherCriteria-specific rubrics
AuthorityConfident tone rated higherRequire evidence citation

Decision Framework

Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)

Quick Reference

Direct Scoring Requirements

  1. Clear criteria definitions
  2. Calibrated scale (1-5 recommended)
  3. Chain-of-thought: justification BEFORE score (improves reliability 15-25%)

Pairwise Comparison Protocol

  1. First pass: A in first position
  2. Second pass: B in first position (swap)
  3. Consistency check: If passes disagree → TIE
  4. Final verdict: Consistent winner with averaged confidence

Rubric Components

  • Level descriptions with clear boundaries
  • Observable characteristics per level
  • Edge case guidance
  • Strictness calibration (lenient/balanced/strict)

Integration

Works with:

  • context-fundamentals - Effective context structure
  • tool-design - Evaluation tool schemas
  • evaluation (foundational) - Core evaluation concepts

For detailed implementation patterns, prompt templates, examples, and metrics: references/full-guide.md

See also: references/implementation-patterns.md, references/bias-mitigation.md, references/metrics-guide.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

evaluation

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

financial-operations-expert

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

youtube-video-analyst

No summary provided by upstream source.

Repository SourceNeeds Review