Prompt Testing
Skill for testing, comparing, and measuring prompt performance.
Documentation
-
metrics.md - Performance metrics definition
-
methodology.md - A/B testing protocol
Testing Workflow
-
DEFINE └── Test objective └── Metrics to measure └── Success criteria
-
PREPARE └── Variants A and B └── Test dataset └── Baseline (if existing)
-
EXECUTE └── Run on dataset └── Collect results └── Document observations
-
ANALYZE └── Calculate metrics └── Compare variants └── Identify patterns
-
DECIDE └── Recommendation └── Statistical confidence └── Next iterations
Performance Metrics
Quality
Metric Description Calculation
Accuracy Correct responses Correct / Total
Compliance Format adherence Compliant / Total
Consistency Response stability 1 - Variance
Relevance Meeting the need Average score (1-5)
Efficiency
Metric Description Calculation
Tokens Input Prompt size Token count
Tokens Output Response size Token count
Latency Response time ms
Cost Price per request Tokens × Price
Robustness
Metric Description Calculation
Edge Cases Edge case handling Passed / Total
Jailbreak Resist Bypass resistance Blocked / Attempts
Error Recovery Error recovery Recovered / Errors
Test Format
Test Dataset
{ "name": "Test Dataset v1", "description": "Dataset for testing prompt XYZ", "cases": [ { "id": "case_001", "type": "standard", "input": "Test input", "expected": "Expected output", "tags": ["basic", "format"] }, { "id": "case_002", "type": "edge_case", "input": "Edge input", "expected": "Expected behavior", "tags": ["edge", "error"] } ] }
Test Report
A/B Test Report: {{TEST_NAME}}
Configuration
| Parameter | Value |
|---|---|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
Tested Variants
Variant A (Baseline)
[Description or link to prompt A]
Variant B (Challenger)
[Description or link to prompt B]
Results
Overall Scores
| Metric | A | B | Delta | Winner |
|---|---|---|---|---|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
Detail by Case Type
| Type | A | B | Notes |
|---|---|---|---|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% |
Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---|---|---|---|---|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
Analysis
B's Strengths
- [Improvement 1]
- [Improvement 2]
B's Weaknesses
- [Regression 1]
Observations
[Qualitative insights]
Recommendation
Verdict: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
Confidence: High / Medium / Low
Justification: [Explanation of recommendation]
Next Steps
- [Action 1]
- [Action 2]
Commands
Create a test
/prompt test create --name "Test v1" --dataset tests.json
Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
View results
/prompt test results --id test_001
Compare two tests
/prompt test compare --tests test_001,test_002
Decision Criteria
When to adopt variant B?
IF:
- Accuracy B >= Accuracy A AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%) AND no regression on edge cases THEN: → Adopt B
ELSE IF:
- Accuracy improvement > 10% AND token regression < 20% THEN: → Consider B (acceptable trade-off)
ELSE: → Keep A or iterate
Best Practices
-
Minimum 20 test cases for significance
-
Include edge cases (15-20% of dataset)
-
Test multiple runs for consistency
-
Document hypotheses before testing
-
Version the prompts being tested