prompt-evaluation

Evaluate and benchmark AI prompts for quality, consistency, and performance. Triggers: prompt evaluation, prompt testing, prompt quality, prompt benchmark, prompt optimization.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-evaluation" with this command: npx skills add sky-lv/skylv-prompt-evaluation

Prompt Evaluation

Evaluate and benchmark AI prompts for quality, consistency, and performance. Score, compare, and optimize your prompts systematically.

Overview

A prompt evaluation framework that helps agents measure prompt quality across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and output consistency. Compare prompt variants and find the optimal version.

Capabilities

1. Quality Scoring

node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness
node evaluate.js score --prompt-file ./prompts/ --output scores.json

Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).

2. A/B Comparison

node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50
node evaluate.js compare --config ab-test-config.json

Run statistical A/B tests between prompt variants with significance analysis.

3. Consistency Check

node evaluate.js consistency --prompt "Translate to French" --runs 100 --variance-threshold 0.15
node evaluate.js consistency --temperature 0.7 --top-p 0.9

Measures output consistency across multiple runs to find the most stable prompts.

4. Regression Testing

node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-set.jsonl
node evaluate.js regression --fail-on-degradation 5%

Detects quality regressions between prompt versions using golden test sets.

5. Cost Analysis

node evaluate.js cost --prompt "Long prompt..." --model gpt-4 --estimate-tokens
node evaluate.js cost --compare-prompts --output cost-report.csv

Estimates token usage and costs for different prompt variants and models.

Configuration

{
  "evaluation": {
    "dimensions": ["clarity", "specificity", "robustness", "cost"],
    "scoringModel": "gpt-4",
    "abTest": {
      "trials": 50,
      "significanceLevel": 0.05
    },
    "consistency": {
      "runs": 100,
      "varianceThreshold": 0.15
    },
    "regression": {
      "degradationThreshold": "5%",
      "goldenSet": "./golden-set.jsonl"
    }
  }
}

Use Cases

  • Prompt Engineering: Systematically improve prompt quality
  • Quality Assurance: Ensure prompts meet quality standards before production
  • Cost Optimization: Find prompts that achieve goals with fewer tokens
  • Version Control: Track prompt quality across versions
  • Agent Tuning: Optimize agent system prompts for consistency

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

LLM Evaluator Pro

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

Registry SourceRecently Updated
7581Profile unavailable
General

Multi-Skill-Eval | 集成化技能评估系统

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。 触发词(中文): 评估技...

Registry SourceRecently Updated
1231Profile unavailable
Automation

Evalpal

Run AI agent evaluations via EvalPal — trigger eval runs, check results, and list available evaluations

Registry SourceRecently Updated
1650Profile unavailable
Coding

Sprint Contract

Multi-agent development workflow with Sprint Contracts and independent QA evaluation. Use when building features, fixing complex bugs, or any task that invol...

Registry SourceRecently Updated
1570Profile unavailable