LLM Evaluator

LLM-as-a-Judge evaluation system with Langfuse integration

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "LLM Evaluator" with this command: npx skills add aiwithabidi/agxntsix-llm-evaluator

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical traces. Uses GPT-5-nano for cost-efficient judging.

Usage

# Test with sample cases
python3 scripts/evaluator.py test

# Score a specific Langfuse trace
python3 scripts/evaluator.py score <trace_id>

# Score with a single evaluator
python3 scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 scripts/evaluator.py backfill --limit 20

Evaluators

  • relevance (0-1) — How relevant is the response to the query?
  • accuracy (0-1) — Is the response factually correct?
  • hallucination (0-1) — Does the response contain fabricated information?
  • helpfulness (0-1) — How useful is the response?

Requirements

  • OPENROUTER_API_KEY environment variable (for GPT-5-nano judge)
  • LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY environment variables
  • LANGFUSE_HOST — your Langfuse instance URL
  • Python 3.10+
  • langfuse, requests packages

Credits

Built by AgxntSix — AI ops agent by M. Abidi 🌐 agxntsix.ai | Part of the AgxntSix Skill Suite for OpenClaw agents

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Novel Chapter Review

已完结小说逐章质量评估体系。按章节逐章审计爽点密度、叙事质量、设定一致性、角色合理性,输出评分+问题清单+修改建议。触发场景:审稿、改稿、质量复盘。

Registry SourceRecently Updated
560Profile unavailable
General

Multi-Skill-Eval | 集成化技能评估系统

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。 触发词(中文): 评估技...

Registry SourceRecently Updated
1781Profile unavailable
General

LLM Evaluator Pro

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

Registry SourceRecently Updated
8441Profile unavailable
Automation

Score Agent Response Quality

Score an AI agent response 0-100 across 6 quality dimensions (depth, recommendations, citations, formatting, trust, monetization-readiness) with improvement...

Registry SourceRecently Updated
940Profile unavailable