Prompt Evaluator

A comprehensive framework for evaluating, comparing, and debugging system prompts for AI assistants.

CAPABILITIES:

Single prompt scoring: Evaluate a prompt across 15 dimensions with detailed scores, issues, and specific fix recommendations
Multi-prompt comparison: Compare 2-5 prompts side-by-side with ranking table and improvement roadmap
Feedback analysis: Analyze user feedback, bug reports, or improvement suggestions to determine if issues are prompt-related and provide actionable fixes

TRIGGERS - Use this skill when:

User asks to "evaluate", "score", "rate", or "review" a prompt
User asks to "compare prompts", "which prompt is better", or "A/B test prompts"
User provides feedback/complaints about AI behavior and asks if it's a prompt issue
User asks "what's wrong with this prompt", "how to improve this prompt"
User mentions "system prompt", "inner prompt", "built-in prompt", "assistant prompt"
User asks about prompt quality, prompt optimization, or prompt debugging
Keywords: prompt evaluation, prompt scoring, prompt comparison, prompt analysis, prompt review

OUTPUT: Structured evaluation report with scores, issues, fixes, and priority recommendations.

Quick Start

Mode 1: Single Prompt Evaluation

Input: One prompt (text or file) Output: 15-dimension score card + prioritized fix recommendations

Mode 2: Multi-Prompt Comparison

Input: 2-5 prompts to compare Output: Comparison table + winner analysis + improvement roadmap

Mode 3: Feedback Analysis

Input: User feedback/complaints + current prompt Output: Root cause analysis + prompt-specific fixes (if applicable)

Evaluation Framework

15 Evaluation Dimensions

Load references/dimensions.md for the complete scoring rubric. Summary:

Dimension Weight Key Question

1 Role Definition 6% Is the AI's identity and persona clearly defined?

2 Task Clarity 10% Is the primary task unambiguous?

3 Constraint Completeness 10% Are DO/DON'T rules comprehensive?

4 Output Format 8% Is the expected output format specified?

5 Example Quality 8% Are examples concrete and representative?

6 Edge Case Handling 10% Are boundary conditions addressed?

7 Business Alignment 10% Does it serve business goals?

8 User Experience 8% Does it create good UX?

9 Safety & Ethics 8% Are safety guardrails in place?

10 Maintainability 5% Is it modular and easy to update?

11 Token Efficiency 4% Is context used efficiently?

12 Robustness 5% Is it resistant to misuse/injection?

13 Consistency 3% Are rules internally consistent?

14 Internationalization 2% Does it handle multiple languages?

15 Measurability 3% Can outcomes be measured?

Mode 1: Single Prompt Evaluation

Input Requirements

The prompt to evaluate (as text, file, or pasted content)
Optional: Context about the product/use case
Optional: Known issues or user complaints

Evaluation Process

STEP 1: Initial Scan

Count total lines and estimate tokens
Identify structural elements (sections, headers, examples)
Detect language(s) used

STEP 2: Dimension-by-Dimension Scoring For each of the 15 dimensions:

Score 1-10 based on rubric in references/dimensions.md
Identify specific issues (quote line numbers)
Classify severity: 🔴 Critical / 🟡 Warning / 🟢 Minor
Provide concrete fix recommendation

STEP 3: Calculate Weighted Score

Total Score = Σ (dimension_score × weight) × 10

STEP 4: Generate Report

Output format:

Prompt Evaluation Report

Summary

Metric	Value
Total Score	XX/100 (Grade)
Lines	XXX
Est. Tokens	~X,XXX
Critical Issues	X
Warnings	X

Score Breakdown

[15-dimension table with scores and brief notes]

Critical Issues (🔴)

[Detailed issues with line references and fixes]

Warnings (🟡)

[Detailed issues with line references and fixes]

Top 3 Priority Fixes

[Most impactful fix with before/after example]
[Second priority fix]
[Third priority fix]

Improvement Roadmap

[Ordered list of all recommended changes]

Mode 2: Multi-Prompt Comparison

Input Requirements

2-5 prompts to compare (labeled A, B, C, etc.)
Optional: Evaluation focus (e.g., "focus on safety" or "focus on UX")

Comparison Process

STEP 1: Individual Evaluation Evaluate each prompt using Mode 1 (abbreviated)

STEP 2: Head-to-Head Comparison For each dimension, compare all prompts and identify:

Winner for that dimension
Key differentiator

STEP 3: Generate Comparison Report

Output format:

Prompt Comparison Report

Overall Ranking

Rank	Prompt	Score	Strengths	Weaknesses
1	[Name]	XX/100	...	...
2	[Name]	XX/100	...	...

Dimension-by-Dimension Comparison

Dimension	Prompt A	Prompt B	Prompt C	Winner
Role Definition	7	8	6	B
Task Clarity	9	7	8	A
...	...	...	...	...

Key Differentiators

[What makes the winner better in specific areas]

Synthesis Recommendation

[How to combine the best elements of each prompt]

Next Version Roadmap

[Prioritized improvements for the winning prompt]

Mode 3: Feedback Analysis

Input Requirements

User feedback, complaints, or bug reports
Current prompt being used
Optional: Conversation logs showing the issue

Analysis Process

STEP 1: Classify Feedback Determine if the issue is:

🎯 Prompt Issue: Fixable by modifying the prompt
⚙️ Backend Issue: Requires code/data/infrastructure changes
🔄 Hybrid Issue: Needs both prompt and backend fixes
❌ Not an Issue: User misunderstanding or expected behavior

STEP 2: Root Cause Analysis If prompt-related:

Identify which dimension(s) are failing
Locate specific rule gaps or conflicts
Trace the failure path

STEP 3: Generate Analysis Report

Output format:

Feedback Analysis Report

Feedback Summary

Item	Details
Feedback Type	[Complaint/Bug/Suggestion]
Issue Classification	[Prompt/Backend/Hybrid/Not an Issue]
Confidence	[High/Medium/Low]

Root Cause Analysis

[Detailed explanation of why the issue occurs]

Is This a Prompt Issue?

[YES/NO/PARTIAL with reasoning]

Affected Dimensions

Dimension	Current Score	Impact
[Dimension]	X/10	[How feedback relates]

Recommended Prompt Fixes

[If prompt-related, provide specific fixes with before/after]

Backend Recommendations

[If backend-related, describe what needs to change]

Validation Criteria

[How to verify the fix works]

Grading Scale

Score Range Grade Description

90-100 A Production-ready, minimal issues

80-89 B+ Good quality, minor improvements needed

70-79 B Functional, several improvements recommended

60-69 C Needs significant work

50-59 D Major issues, not recommended for production

<50 F Fundamental problems, requires rewrite

Common Anti-Patterns

When evaluating, watch for these red flags:

Vague Role: "You are a helpful assistant" (no specificity)
Missing Constraints: No DO NOT rules
No Examples: Abstract rules without concrete demonstrations
Contradictory Rules: Conflicting instructions
Token Bloat: Unnecessary repetition or verbose explanations
No Safety Rails: Missing content/behavior boundaries
Hardcoded Values: Values that should be dynamic
No Fallback: Missing error/edge case handling
Monolithic Structure: No modular sections
Injection Vulnerability: No protection against prompt attacks

See references/anti-patterns.md for detailed examples and fixes.

Output Quality Standards

All evaluation outputs must:

Be Actionable: Every issue must have a specific fix recommendation
Include Evidence: Quote specific lines/sections from the prompt
Prioritize: Rank issues by impact (Critical > Warning > Minor)
Provide Before/After: Show concrete examples of recommended changes
Be Measurable: Include expected improvement metrics where possible

prompt-evaluator

Safety Notice

Copy this and send it to your AI assistant to learn

Prompt Evaluation Report

Summary

Score Breakdown

Critical Issues (🔴)

Warnings (🟡)

Top 3 Priority Fixes

Improvement Roadmap

Prompt Comparison Report

Overall Ranking

Dimension-by-Dimension Comparison

Key Differentiators

Synthesis Recommendation

Next Version Roadmap

Feedback Analysis Report

Feedback Summary

Root Cause Analysis

Is This a Prompt Issue?

Affected Dimensions

Recommended Prompt Fixes

Backend Recommendations

Validation Criteria

Source Transparency

Related Skills

Kafka

Helm

Cms

Valuation