prompt-evaluator

A comprehensive framework for evaluating, comparing, and debugging system prompts for AI assistants.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-evaluator" with this command: npx skills add sunhao25/prompt-evaluator-skill/sunhao25-prompt-evaluator-skill-prompt-evaluator

Prompt Evaluator

A comprehensive framework for evaluating, comparing, and debugging system prompts for AI assistants.

CAPABILITIES:

  • Single prompt scoring: Evaluate a prompt across 15 dimensions with detailed scores, issues, and specific fix recommendations

  • Multi-prompt comparison: Compare 2-5 prompts side-by-side with ranking table and improvement roadmap

  • Feedback analysis: Analyze user feedback, bug reports, or improvement suggestions to determine if issues are prompt-related and provide actionable fixes

TRIGGERS - Use this skill when:

  • User asks to "evaluate", "score", "rate", or "review" a prompt

  • User asks to "compare prompts", "which prompt is better", or "A/B test prompts"

  • User provides feedback/complaints about AI behavior and asks if it's a prompt issue

  • User asks "what's wrong with this prompt", "how to improve this prompt"

  • User mentions "system prompt", "inner prompt", "built-in prompt", "assistant prompt"

  • User asks about prompt quality, prompt optimization, or prompt debugging

  • Keywords: prompt evaluation, prompt scoring, prompt comparison, prompt analysis, prompt review

OUTPUT: Structured evaluation report with scores, issues, fixes, and priority recommendations.

Quick Start

Mode 1: Single Prompt Evaluation

Input: One prompt (text or file) Output: 15-dimension score card + prioritized fix recommendations

Mode 2: Multi-Prompt Comparison

Input: 2-5 prompts to compare Output: Comparison table + winner analysis + improvement roadmap

Mode 3: Feedback Analysis

Input: User feedback/complaints + current prompt Output: Root cause analysis + prompt-specific fixes (if applicable)

Evaluation Framework

15 Evaluation Dimensions

Load references/dimensions.md for the complete scoring rubric. Summary:

Dimension Weight Key Question

1 Role Definition 6% Is the AI's identity and persona clearly defined?

2 Task Clarity 10% Is the primary task unambiguous?

3 Constraint Completeness 10% Are DO/DON'T rules comprehensive?

4 Output Format 8% Is the expected output format specified?

5 Example Quality 8% Are examples concrete and representative?

6 Edge Case Handling 10% Are boundary conditions addressed?

7 Business Alignment 10% Does it serve business goals?

8 User Experience 8% Does it create good UX?

9 Safety & Ethics 8% Are safety guardrails in place?

10 Maintainability 5% Is it modular and easy to update?

11 Token Efficiency 4% Is context used efficiently?

12 Robustness 5% Is it resistant to misuse/injection?

13 Consistency 3% Are rules internally consistent?

14 Internationalization 2% Does it handle multiple languages?

15 Measurability 3% Can outcomes be measured?

Mode 1: Single Prompt Evaluation

Input Requirements

  • The prompt to evaluate (as text, file, or pasted content)

  • Optional: Context about the product/use case

  • Optional: Known issues or user complaints

Evaluation Process

STEP 1: Initial Scan

  • Count total lines and estimate tokens

  • Identify structural elements (sections, headers, examples)

  • Detect language(s) used

STEP 2: Dimension-by-Dimension Scoring For each of the 15 dimensions:

  • Score 1-10 based on rubric in references/dimensions.md

  • Identify specific issues (quote line numbers)

  • Classify severity: 🔴 Critical / 🟡 Warning / 🟢 Minor

  • Provide concrete fix recommendation

STEP 3: Calculate Weighted Score

Total Score = Σ (dimension_score × weight) × 10

STEP 4: Generate Report

Output format:

Prompt Evaluation Report

Summary

MetricValue
Total ScoreXX/100 (Grade)
LinesXXX
Est. Tokens~X,XXX
Critical IssuesX
WarningsX

Score Breakdown

[15-dimension table with scores and brief notes]

Critical Issues (🔴)

[Detailed issues with line references and fixes]

Warnings (🟡)

[Detailed issues with line references and fixes]

Top 3 Priority Fixes

  1. [Most impactful fix with before/after example]
  2. [Second priority fix]
  3. [Third priority fix]

Improvement Roadmap

[Ordered list of all recommended changes]

Mode 2: Multi-Prompt Comparison

Input Requirements

  • 2-5 prompts to compare (labeled A, B, C, etc.)

  • Optional: Evaluation focus (e.g., "focus on safety" or "focus on UX")

Comparison Process

STEP 1: Individual Evaluation Evaluate each prompt using Mode 1 (abbreviated)

STEP 2: Head-to-Head Comparison For each dimension, compare all prompts and identify:

  • Winner for that dimension

  • Key differentiator

STEP 3: Generate Comparison Report

Output format:

Prompt Comparison Report

Overall Ranking

RankPromptScoreStrengthsWeaknesses
1[Name]XX/100......
2[Name]XX/100......

Dimension-by-Dimension Comparison

DimensionPrompt APrompt BPrompt CWinner
Role Definition786B
Task Clarity978A
...............

Key Differentiators

[What makes the winner better in specific areas]

Synthesis Recommendation

[How to combine the best elements of each prompt]

Next Version Roadmap

[Prioritized improvements for the winning prompt]

Mode 3: Feedback Analysis

Input Requirements

  • User feedback, complaints, or bug reports

  • Current prompt being used

  • Optional: Conversation logs showing the issue

Analysis Process

STEP 1: Classify Feedback Determine if the issue is:

  • 🎯 Prompt Issue: Fixable by modifying the prompt

  • ⚙️ Backend Issue: Requires code/data/infrastructure changes

  • 🔄 Hybrid Issue: Needs both prompt and backend fixes

  • ❌ Not an Issue: User misunderstanding or expected behavior

STEP 2: Root Cause Analysis If prompt-related:

  • Identify which dimension(s) are failing

  • Locate specific rule gaps or conflicts

  • Trace the failure path

STEP 3: Generate Analysis Report

Output format:

Feedback Analysis Report

Feedback Summary

ItemDetails
Feedback Type[Complaint/Bug/Suggestion]
Issue Classification[Prompt/Backend/Hybrid/Not an Issue]
Confidence[High/Medium/Low]

Root Cause Analysis

[Detailed explanation of why the issue occurs]

Is This a Prompt Issue?

[YES/NO/PARTIAL with reasoning]

Affected Dimensions

DimensionCurrent ScoreImpact
[Dimension]X/10[How feedback relates]

Recommended Prompt Fixes

[If prompt-related, provide specific fixes with before/after]

Backend Recommendations

[If backend-related, describe what needs to change]

Validation Criteria

[How to verify the fix works]

Grading Scale

Score Range Grade Description

90-100 A Production-ready, minimal issues

80-89 B+ Good quality, minor improvements needed

70-79 B Functional, several improvements recommended

60-69 C Needs significant work

50-59 D Major issues, not recommended for production

<50 F Fundamental problems, requires rewrite

Common Anti-Patterns

When evaluating, watch for these red flags:

  • Vague Role: "You are a helpful assistant" (no specificity)

  • Missing Constraints: No DO NOT rules

  • No Examples: Abstract rules without concrete demonstrations

  • Contradictory Rules: Conflicting instructions

  • Token Bloat: Unnecessary repetition or verbose explanations

  • No Safety Rails: Missing content/behavior boundaries

  • Hardcoded Values: Values that should be dynamic

  • No Fallback: Missing error/edge case handling

  • Monolithic Structure: No modular sections

  • Injection Vulnerability: No protection against prompt attacks

See references/anti-patterns.md for detailed examples and fixes.

Output Quality Standards

All evaluation outputs must:

  • Be Actionable: Every issue must have a specific fix recommendation

  • Include Evidence: Quote specific lines/sections from the prompt

  • Prioritize: Rank issues by impact (Critical > Warning > Minor)

  • Provide Before/After: Show concrete examples of recommended changes

  • Be Measurable: Include expected improvement metrics where possible

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Kafka

Kafka - command-line tool for everyday use

Registry SourceRecently Updated
General

Helm

Helm - command-line tool for everyday use

Registry SourceRecently Updated
General

Cms

Cms - command-line tool for everyday use

Registry SourceRecently Updated
General

Valuation

Valuation - command-line tool for everyday use

Registry SourceRecently Updated