rag-eval

Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-eval" with this command: npx skills add jonathanjing/rag-eval

RAG Eval — Quality Testing for Your RAG Pipeline

Test and monitor your RAG pipeline's output quality.

🛠️ Installation

1. Ask OpenClaw (Recommended)

Tell OpenClaw: "Install the rag-eval skill." The agent will handle the installation and configuration automatically.

2. Manual Installation (CLI)

If you prefer the terminal, run:

clawhub install rag-eval

⚠️ Prerequisites

  1. Your OpenClaw must have a RAG system (vector DB + retrieval pipeline). This skill evaluates the output quality of that pipeline — it does not provide RAG functionality itself.
  2. At least one LLM API key is required — Ragas uses an LLM as judge internally. Set one of:
    • OPENAI_API_KEY (default, uses GPT-4o)
    • ANTHROPIC_API_KEY (uses Claude Haiku)
    • RAGAS_LLM=ollama/llama3 (for local/offline evaluation)

Setup (first run only)

bash scripts/setup.sh

This installs ragas, datasets, and other dependencies.

Single Response Evaluation

When user asks to evaluate an answer, collect:

  1. question — the original user question
  2. answer — the LLM output to evaluate
  3. contexts — list of text chunks used to generate the answer (retrieved docs)

⚠️ SECURITY: Never interpolate user content directly into shell commands. Write the input to a temp JSON file first, then pipe it to the evaluator:

# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json

# Step 3: Clean up
rm -f /tmp/rag-eval-input.json

Alternatively, use --input-file:

python3 scripts/run_eval.py --input-file /tmp/rag-eval-input.json

Output JSON:

{
  "faithfulness": 0.92,
  "answer_relevancy": 0.87,
  "context_precision": 0.79,
  "overall_score": 0.86,
  "verdict": "PASS",
  "flags": []
}

Post results to user with human-readable summary:

🧪 Eval Results
• Faithfulness: 0.92 ✅ (no hallucination detected)
• Answer Relevancy: 0.87 ✅
• Context Precision: 0.79 ⚠️ (some irrelevant context retrieved)
• Overall: 0.86 — PASS

Save to memory/eval-results/YYYY-MM-DD.jsonl.

Batch Evaluation

For a JSONL dataset file (each line: {"question":..., "answer":..., "contexts":[...]}):

python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json

Score Interpretation

ScoreVerdictMeaning
0.85+✅ PASSProduction-ready quality
0.70-0.84⚠️ REVIEWNeeds improvement
< 0.70❌ FAILSignificant quality issues

Faithfulness Deep-Dive

If faithfulness < 0.80, run:

python3 scripts/run_eval.py --explain --metric faithfulness

This outputs which sentences in the answer are NOT supported by context.

Notes

  • Ragas uses an LLM internally as judge (uses your configured OpenAI/Anthropic key)
  • Evaluation costs ~$0.01-0.05 per response depending on length
  • For offline use, set RAGAS_LLM=ollama/llama3 in environment

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Accelo

Accelo integration. Manage Organizations, Leads, Pipelines, Users, Goals, Filters. Use when the user wants to interact with Accelo data.

Registry SourceRecently Updated
General

8X8

8x8 integration. Manage Persons, Organizations, Deals, Leads, Activities, Notes and more. Use when the user wants to interact with 8x8 data.

Registry SourceRecently Updated
General

7Shifts

7shifts integration. Manage Companies. Use when the user wants to interact with 7shifts data.

Registry SourceRecently Updated
General

46Elks

46elks integration. Manage Organizations. Use when the user wants to interact with 46elks data.

Registry SourceRecently Updated