rag-eval

Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-eval" with this command: npx skills add JonathanJing/rag-eval

RAG Eval — Quality Testing for Your RAG Pipeline

Test and monitor your RAG pipeline's output quality.

🛠️ Installation

1. Ask OpenClaw (Recommended)

Tell OpenClaw: "Install the rag-eval skill." The agent will handle the installation and configuration automatically.

2. Manual Installation (CLI)

If you prefer the terminal, run:

clawhub install rag-eval

⚠️ Prerequisites

  1. Your OpenClaw must have a RAG system (vector DB + retrieval pipeline). This skill evaluates the output quality of that pipeline — it does not provide RAG functionality itself.
  2. At least one LLM API key is required — Ragas uses an LLM as judge internally. Set one of:
    • OPENAI_API_KEY (default, uses GPT-4o)
    • ANTHROPIC_API_KEY (uses Claude Haiku)
    • RAGAS_LLM=ollama/llama3 (for local/offline evaluation)

Setup (first run only)

bash scripts/setup.sh

This installs ragas, datasets, and other dependencies.

Single Response Evaluation

When user asks to evaluate an answer, collect:

  1. question — the original user question
  2. answer — the LLM output to evaluate
  3. contexts — list of text chunks used to generate the answer (retrieved docs)

⚠️ SECURITY: Never interpolate user content directly into shell commands. Write the input to a temp JSON file first, then pipe it to the evaluator:

# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json

# Step 3: Clean up
rm -f /tmp/rag-eval-input.json

Alternatively, use --input-file:

python3 scripts/run_eval.py --input-file /tmp/rag-eval-input.json

Output JSON:

{
  "faithfulness": 0.92,
  "answer_relevancy": 0.87,
  "context_precision": 0.79,
  "overall_score": 0.86,
  "verdict": "PASS",
  "flags": []
}

Post results to user with human-readable summary:

🧪 Eval Results
• Faithfulness: 0.92 ✅ (no hallucination detected)
• Answer Relevancy: 0.87 ✅
• Context Precision: 0.79 ⚠️ (some irrelevant context retrieved)
• Overall: 0.86 — PASS

Save to memory/eval-results/YYYY-MM-DD.jsonl.

Batch Evaluation

For a JSONL dataset file (each line: {"question":..., "answer":..., "contexts":[...]}):

python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json

Score Interpretation

ScoreVerdictMeaning
0.85+✅ PASSProduction-ready quality
0.70-0.84⚠️ REVIEWNeeds improvement
< 0.70❌ FAILSignificant quality issues

Faithfulness Deep-Dive

If faithfulness < 0.80, run:

python3 scripts/run_eval.py --explain --metric faithfulness

This outputs which sentences in the answer are NOT supported by context.

Notes

  • Ragas uses an LLM internally as judge (uses your configured OpenAI/Anthropic key)
  • Evaluation costs ~$0.01-0.05 per response depending on length
  • For offline use, set RAGAS_LLM=ollama/llama3 in environment

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Zip

Zip - command-line tool for everyday use

Registry SourceRecently Updated
General

Youtube Script

YouTube视频脚本、标题A/B测试、缩略图文案、SEO优化、开头Hook、章节标记。YouTube script writer with title testing, thumbnail copy, SEO optimization, hooks, chapter markers. Use when you...

Registry SourceRecently Updated
1760ckchzh
General

Topmediai AI Music Generator

Generate AI music, BGM, or lyrics via TopMediai API. Supports auto polling and two-stage output (preview first, then final full audio) for generation tasks.

Registry SourceRecently Updated
General

Yamlcheck

YAML validator and formatter. Validate YAML syntax, pretty-print with proper indentation, convert between YAML and JSON, and lint YAML files for common issues.

Registry SourceRecently Updated