prompt-lab

Systematic LLM prompt engineering: analyzes existing prompts for failure modes, generates structured variants (direct, few-shot, chain-of-thought), designs evaluation rubrics with weighted criteria, and produces test case suites for comparing prompt performance. Triggers on: "prompt engineering", "prompt lab", "generate prompt variants", "A/B test prompts", "evaluate prompt", "optimize prompt", "write a better prompt", "prompt design", "prompt iteration", "few-shot examples", "chain-of-thought prompt", "prompt failure modes", "improve this prompt". Use this skill when designing, improving, or evaluating LLM prompts specifically. NOT for evaluating Claude Code skills or SKILL.md files — use skill-evaluator instead.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-lab" with this command: npx skills add mathews-tom/praxis-skills/mathews-tom-praxis-skills-prompt-lab

Prompt Lab

Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.

Reference Files

FileContentsLoad When
references/prompt-patterns.mdPrompt structure catalog: zero-shot, few-shot, CoT, persona, structured outputAlways
references/evaluation-metrics.mdQuality metrics (accuracy, format compliance, completeness), rubric designEvaluation needed
references/failure-modes.mdCommon prompt failure taxonomy, detection strategies, mitigationsFailure analysis requested
references/output-constraints.mdTechniques for constraining LLM output format, JSON mode, schema enforcementFormat control needed

Prerequisites

  • Clear objective: what should the prompt accomplish?
  • Target model (GPT-4, Claude, open-source) — prompting techniques vary by model
  • Current prompt (if improving) or task description (if creating)

Workflow

Phase 1: Define Objective

  1. Task specification — What should the LLM produce? Be specific: "Classify customer support tickets into 5 categories" not "Handle support tickets."
  2. Success criteria — How do you know the output is correct? Define measurable criteria before writing any prompt.
  3. Failure modes — What does a bad output look like? Missing information? Wrong format? Hallucinated content? Refusal to answer?

Phase 2: Analyze Current Prompt

If an existing prompt is provided:

  1. Structure assessment — Is the instruction clear? Are examples provided? Is the output format specified?
  2. Ambiguity detection — Where could the model misinterpret the instruction?
  3. Missing components — What's not specified that should be? (output format, tone, length constraints, edge case handling)
  4. Failure mode mapping — Which known failure patterns (see references/failure-modes.md) apply to this prompt?

Phase 3: Generate Variants

Create 2-4 prompt variants, each testing a different hypothesis:

Variant TypeHypothesisWhen to Use
Direct instructionClear instruction is sufficientSimple tasks, capable models
Few-shotExamples improve output consistencyPattern-following tasks
Chain-of-thoughtReasoning improves accuracyMulti-step logic, math, analysis
Persona/roleRole framing improves tone/expertiseDomain-specific tasks
Structured outputFormat specification prevents errorsJSON, CSV, specific templates

For each variant:

  • State the hypothesis (why this variant might work)
  • Identify the risk (what could go wrong)
  • Provide the complete prompt text

Phase 4: Design Evaluation

  1. Rubric — Define weighted criteria:

    CriterionWhat It MeasuresTypical Weight
    CorrectnessOutput matches expected answer30-50%
    Format complianceFollows specified structure15-25%
    CompletenessAll required elements present15-25%
    ConcisenessNo unnecessary content5-15%
    Tone/styleMatches requested voice5-10%
  2. Test cases — Minimum 5 cases covering:

    • Happy path (standard input)
    • Edge cases (unusual but valid input)
    • Adversarial cases (inputs designed to confuse)
    • Boundary cases (minimum/maximum input)

Phase 5: Output

Present variants, rubric, and test cases in a structured format ready for execution.

Output Format

## Prompt Lab: {Task Name}

### Objective
{What the prompt should achieve — specific and measurable}

### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}

### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}

### Variants

#### Variant A: {Strategy Name}

{Complete prompt text}

**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

#### Variant B: {Strategy Name}

{Complete prompt text}

**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

#### Variant C: {Strategy Name}

{Complete prompt text}

**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

### Evaluation Rubric

| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |

### Test Cases

| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |

### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}

### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements

Calibration Rules

  1. One variable per variant. Each variant should change ONE thing from the baseline. Changing instruction style AND examples AND format simultaneously makes results uninterpretable.
  2. Test before declaring success. A prompt that works on 3 examples may fail on the 4th. Minimum 5 diverse test cases before concluding a variant works.
  3. Failure modes are more valuable than successes. Understanding WHY a prompt fails guides improvement more than confirming it works.
  4. Model-specific optimization. A prompt optimized for GPT-4 may not work for Claude or Llama. Always note the target model.
  5. Simplest effective prompt wins. If a zero-shot prompt scores as well as a few-shot prompt, use the zero-shot. Fewer tokens = lower cost + latency.

Error Handling

ProblemResolution
No clear objectiveAsk the user to define what "good output" looks like with 2-3 examples.
Prompt is for a task LLMs are bad at (math, counting)Flag the limitation. Suggest tool-augmented approaches or pre/post-processing.
Too many variables to testFocus on the highest-impact variable first. Iterative refinement beats combinatorial testing.
No existing prompt to analyzeStart with the simplest possible prompt. The first variant IS the baseline.
Output format requirements are strictUse structured output mode (JSON mode, function calling) instead of prompt-only constraints.

When NOT to Use

Push back if:

  • The task doesn't need an LLM (deterministic rules, regex, SQL) — use the right tool
  • The user wants prompt execution, not design — this skill designs and evaluates, it doesn't run prompts
  • The prompt is for safety-critical decisions without human review — LLM output should not be the sole input

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

html-presentation

No summary provided by upstream source.

Repository SourceNeeds Review
General

concept-to-image

No summary provided by upstream source.

Repository SourceNeeds Review
General

sequential-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
General

linkedin-post-style

No summary provided by upstream source.

Repository SourceNeeds Review