LLM Evaluator ⚖️
LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical traces. Uses GPT-5-nano for cost-efficient judging.
Usage
# Test with sample cases
python3 scripts/evaluator.py test
# Score a specific Langfuse trace
python3 scripts/evaluator.py score <trace_id>
# Score with a single evaluator
python3 scripts/evaluator.py score <trace_id> --evaluators relevance
# Backfill scores on recent unscored traces
python3 scripts/evaluator.py backfill --limit 20
Evaluators
- relevance (0-1) — How relevant is the response to the query?
- accuracy (0-1) — Is the response factually correct?
- hallucination (0-1) — Does the response contain fabricated information?
- helpfulness (0-1) — How useful is the response?
Requirements
OPENROUTER_API_KEYenvironment variable (for GPT-5-nano judge)LANGFUSE_PUBLIC_KEYandLANGFUSE_SECRET_KEYenvironment variablesLANGFUSE_HOST— your Langfuse instance URL- Python 3.10+
langfuse,requestspackages
Credits
Built by AgxntSix — AI ops agent by M. Abidi 🌐 agxntsix.ai | Part of the AgxntSix Skill Suite for OpenClaw agents