llm-evaluation

Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-evaluation" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-llm-evaluation

LLM Evaluation

Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.

Quick Reference

LLM-as-Judge Pattern

async def evaluate_quality(input_text: str, output_text: str, dimension: str) -> float: response = await llm.chat([{ "role": "user", "content": f"""Evaluate for {dimension}. Score 1-10. Input: {input_text[:500]} Output: {output_text[:1000]} Respond with just the number.""" }]) return int(response.content.strip()) / 10

Quality Gate

QUALITY_THRESHOLD = 0.7

async def quality_gate(state: dict) -> dict: scores = await full_quality_assessment(state["input"], state["output"]) passed = scores["average"] >= QUALITY_THRESHOLD return {**state, "quality_passed": passed}

Hallucination Detection

async def detect_hallucination(context: str, output: str) -> dict: # Check if output contains claims not in context return {"has_hallucinations": bool, "unsupported_claims": []}

RAGAS Metrics ()

Metric Use Case Threshold

Faithfulness RAG grounding ≥ 0.8

Answer Relevancy Q&A systems ≥ 0.7

Context Precision Retrieval quality ≥ 0.7

Context Recall Retrieval completeness ≥ 0.7

Anti-Patterns (FORBIDDEN)

❌ NEVER use same model as judge and evaluated

output = await gpt4.complete(prompt) score = await gpt4.evaluate(output) # Same model!

❌ NEVER use single dimension

if relevance_score > 0.7: # Only checking one thing return "pass"

❌ NEVER set threshold too high

THRESHOLD = 0.95 # Blocks most content

✅ ALWAYS use different judge model

score = await gpt4_mini.evaluate(claude_output)

✅ ALWAYS use multiple dimensions

scores = await evaluate_all_dimensions(output) if scores["average"] > 0.7: return "pass"

Key Decisions

Decision Recommendation

Judge model GPT-5.2-mini or Claude Haiku 4.5

Threshold 0.7 for production, 0.6 for drafts

Dimensions 3-5 most relevant to use case

Sample size 50+ for reliable metrics

Detailed Documentation

Resource Description

references/evaluation-metrics.md RAGAS & LLM-as-judge metrics

examples/evaluation-patterns.md Complete evaluation examples

checklists/evaluation-checklist.md Setup and review checklists

scripts/evaluator-template.py Starter evaluation template

Related Skills

  • quality-gates

  • Workflow quality control

  • langfuse-observability

  • Tracking evaluation scores

  • agent-loops

  • Self-correcting with evaluation

Capability Details

llm-as-judge

Keywords: LLM judge, judge model, evaluation model, grader LLM Solves:

  • Use LLM to evaluate other LLM outputs

  • Implement judge prompts for quality

  • Configure evaluation criteria

ragas-metrics

Keywords: RAGAS, faithfulness, answer relevancy, context precision Solves:

  • Evaluate RAG with RAGAS metrics

  • Measure faithfulness and relevancy

  • Assess context precision and recall

hallucination-detection

Keywords: hallucination, factuality, grounded, verify facts Solves:

  • Detect hallucinations in LLM output

  • Verify factual accuracy

  • Implement grounding checks

quality-gates

Keywords: quality gate, threshold, pass/fail, evaluation gate Solves:

  • Implement quality thresholds

  • Block low-quality outputs

  • Configure multi-metric gates

batch-evaluation

Keywords: batch eval, dataset evaluation, bulk scoring, eval suite Solves:

  • Evaluate over golden datasets

  • Run batch evaluation pipelines

  • Generate evaluation reports

pairwise-comparison

Keywords: pairwise, A/B comparison, side-by-side, preference Solves:

  • Compare two model outputs

  • Implement preference ranking

  • Run A/B evaluations

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

rag-retrieval

No summary provided by upstream source.

Repository SourceNeeds Review