prompt-testing

Skill for testing, comparing, and measuring prompt performance.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-testing" with this command: npx skills add fusengine/agents/fusengine-agents-prompt-testing

Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

Documentation

  • metrics.md - Performance metrics definition

  • methodology.md - A/B testing protocol

Testing Workflow

  1. DEFINE └── Test objective └── Metrics to measure └── Success criteria

  2. PREPARE └── Variants A and B └── Test dataset └── Baseline (if existing)

  3. EXECUTE └── Run on dataset └── Collect results └── Document observations

  4. ANALYZE └── Calculate metrics └── Compare variants └── Identify patterns

  5. DECIDE └── Recommendation └── Statistical confidence └── Next iterations

Performance Metrics

Quality

Metric Description Calculation

Accuracy Correct responses Correct / Total

Compliance Format adherence Compliant / Total

Consistency Response stability 1 - Variance

Relevance Meeting the need Average score (1-5)

Efficiency

Metric Description Calculation

Tokens Input Prompt size Token count

Tokens Output Response size Token count

Latency Response time ms

Cost Price per request Tokens × Price

Robustness

Metric Description Calculation

Edge Cases Edge case handling Passed / Total

Jailbreak Resist Bypass resistance Blocked / Attempts

Error Recovery Error recovery Recovered / Errors

Test Format

Test Dataset

{ "name": "Test Dataset v1", "description": "Dataset for testing prompt XYZ", "cases": [ { "id": "case_001", "type": "standard", "input": "Test input", "expected": "Expected output", "tags": ["basic", "format"] }, { "id": "case_002", "type": "edge_case", "input": "Edge input", "expected": "Expected behavior", "tags": ["edge", "error"] } ] }

Test Report

A/B Test Report: {{TEST_NAME}}

Configuration

ParameterValue
Date{{DATE}}
Dataset{{DATASET}}
Cases tested{{N_CASES}}
Model{{MODEL}}

Tested Variants

Variant A (Baseline)

[Description or link to prompt A]

Variant B (Challenger)

[Description or link to prompt B]

Results

Overall Scores

MetricABDeltaWinner
AccuracyX%Y%+/-Z%A/B
ComplianceX%Y%+/-Z%A/B
TokensXY+/-ZA/B
LatencyXmsYms+/-ZmsA/B

Detail by Case Type

TypeABNotes
StandardX%Y%
Edge casesX%Y%
Error casesX%Y%

Problematic Cases

Case IDExpectedABAnalysis
case_XXX...[Explanation]

Analysis

B's Strengths

  • [Improvement 1]
  • [Improvement 2]

B's Weaknesses

  • [Regression 1]

Observations

[Qualitative insights]

Recommendation

Verdict: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

Confidence: High / Medium / Low

Justification: [Explanation of recommendation]

Next Steps

  1. [Action 1]
  2. [Action 2]

Commands

Create a test

/prompt test create --name "Test v1" --dataset tests.json

Run an A/B test

/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

View results

/prompt test results --id test_001

Compare two tests

/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

IF:

  • Accuracy B >= Accuracy A AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%) AND no regression on edge cases THEN: → Adopt B

ELSE IF:

  • Accuracy improvement > 10% AND token regression < 20% THEN: → Consider B (acceptable trade-off)

ELSE: → Keep A or iterate

Best Practices

  • Minimum 20 test cases for significance

  • Include edge cases (15-20% of dataset)

  • Test multiple runs for consistency

  • Document hypotheses before testing

  • Version the prompts being tested

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

laravel-livewire

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

laravel-blade

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

laravel-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

nextjs-i18n

No summary provided by upstream source.

Repository SourceNeeds Review