Prompt Engineering

Overview

Untested prompts in production are bugs you haven't found yet. Vibes-based prompt tuning is not engineering.

Core principle: EVERY prompt is versioned, tested, and evaluated against ground truth before deployment.

Violating the letter of this process is violating the spirit of LLM engineering.

The Iron Law

EVERY PROMPT IS VERSIONED, TESTED, AND EVALUATED AGAINST GROUND TRUTH

If you haven't evaluated it on a test set, it's not ready for production. "It looked good in the playground" is not evaluation.

When to Use

Use for ANY LLM integration work:

Designing new prompts for applications
Modifying existing prompts
Building RAG pipelines
Implementing tool use / function calling
Optimizing token cost or latency
Migrating between models
Evaluating model outputs

Use this ESPECIALLY when:

Prompt "works most of the time"
You're tuning prompts by hand in a playground
Someone says "just tweak the prompt a bit"
Deploying prompt changes without evaluation
Switching models and assuming prompts transfer

Don't skip when:

The prompt is "simple" (simple prompts fail on edge cases)
You're "just fixing a typo" (typos change model behavior)
It's an internal tool (internal users deserve quality too)

The Five Phases

You MUST complete each phase before proceeding to the next.

Phase 1: Prompt Design

BEFORE writing ANY prompt:

Define the Task Precisely

What exactly should the model do?
What are valid outputs?
What are invalid outputs?
What edge cases exist?
Write these down. They become your evaluation criteria.

Select the Right Pattern

Pattern When to Use Example

Zero-shot Simple, well-defined tasks Classification, extraction

Few-shot Task needs examples to clarify format/behavior Structured data extraction, style matching

Chain-of-thought Reasoning, analysis, multi-step logic Math, code review, complex classification

System/User/Assistant roles Conversational applications Chatbots, assistants

Tool use Model needs to take actions or access data API calls, database queries, calculations

Structure the Prompt

Rules:

Report only confirmed issues, not style preferences
Include file path and line number for each issue
Classify severity as: critical, warning, info

Design Output Format

Specify exactly what the output should look like
Use JSON mode or tool use for structured output
Include examples of expected output in the prompt
Constrain the model: what it MUST include, what it MUST NOT include

Phase 2: Anthropic-Specific Best Practices

When using Claude models:

XML Tags for Structure

<document> {{document_content}} </document>

<instructions> Summarize the document above in 3 bullet points. Focus on actionable insights only. </instructions>

XML tags reduce ambiguity between instructions and content
Use them to separate input data from instructions
Use them to delineate sections of complex prompts

Prefilling for Format Control

Assistant: {"analysis": [

Start the assistant response to lock in format
Prevents preamble ("Sure, I'd be happy to...")
Forces specific output structure

Prompt Caching

Place stable content (system prompt, reference docs) first
Place variable content (user input) last
Use cache breakpoints for long static contexts
Measure cost savings: cached tokens are significantly cheaper

Extended Thinking

Enable for complex reasoning tasks
Budget thinking tokens appropriately
Don't enable for simple extraction/classification (waste of tokens)

Phase 3: RAG Design Patterns

When building retrieval-augmented generation:

Retrieval Quality First

Bad retrieval = bad generation, regardless of prompt quality
Test retrieval independently before testing generation
Measure retrieval recall: are relevant documents being found?

Context Window Management

<retrieved_documents> <document index="1" source="{{source_1}}" relevance_score="{{score_1}}"> {{content_1}} </document> <document index="2" source="{{source_2}}" relevance_score="{{score_2}}"> {{content_2}} </document> </retrieved_documents>

<instructions> Answer the user's question using ONLY the documents above. If the answer is not in the documents, say "I don't have enough information." Cite document numbers for each claim. </instructions>

Grounding and Attribution

Require citations to source documents
Instruct the model to say "I don't know" when information is missing
Test for hallucination: ask questions NOT in the context
Verify the model doesn't fabricate sources

Chunking Strategy

Chunk size affects retrieval quality
Too small: loses context
Too large: dilutes relevance
Test different chunk sizes and measure retrieval recall

Phase 4: Testing and Evaluation

BEFORE deploying ANY prompt:

Build an Evaluation Dataset

Minimum 20-50 examples for basic evaluation
Cover happy paths AND edge cases
Include adversarial inputs
Include ground truth (expected outputs)
Version your eval dataset alongside your prompts

Define Metrics

Task Type Metrics

Classification Accuracy, precision, recall, F1

Extraction Exact match, partial match, field-level accuracy

Generation LLM-as-judge, human eval, ROUGE/BLEU (limited)

RAG Faithfulness, relevance, citation accuracy

Run Evaluations Systematically

Every prompt change triggers evaluation

results = evaluate( prompt=prompt_v2, dataset=eval_dataset, metrics=[accuracy, faithfulness, latency], )

Compare against previous version

assert results.accuracy >= baseline.accuracy - REGRESSION_THRESHOLD assert results.faithfulness >= 0.95

Test for Failure Modes

Prompt injection attempts
Extremely long inputs
Empty or malformed inputs
Inputs in unexpected languages
Adversarial edge cases designed to break the prompt

LLM-as-Judge for Generation Quality

Use a separate LLM call to evaluate output quality
Define rubrics: what makes a good vs. bad output
Calibrate judge against human evaluations
Don't use the same model to judge itself when possible

Phase 5: Versioning and Operations

Every prompt in production follows these rules:

Version Control

prompts/ ├── code-review/ │ ├── v1.0.0.txt # Initial version │ ├── v1.1.0.txt # Added severity classification │ ├── v2.0.0.txt # Restructured for tool use │ ├── eval_dataset.jsonl # Test cases │ └── CHANGELOG.md # What changed and why

Semantic versioning: major.minor.patch
Major: behavior change. Minor: improvement. Patch: typo/formatting.
Every version has evaluation results recorded

A/B Testing

Route traffic between prompt versions
Measure real-world performance
Statistical significance before declaring winner
Don't declare "better" from 10 examples

Cost Optimization

Measure tokens per request (input and output)
Choose the right model for the task (don't use the largest model for simple classification)
Use prompt caching for repeated contexts
Batch requests where possible
Monitor cost per request in production

Security

Input sanitization before prompt injection
Output validation before returning to users
Rate limiting on LLM endpoints
Never expose system prompts to end users
Test for jailbreak and extraction attacks

Red Flags - STOP and Follow Process

If you catch yourself thinking:

"It works in the playground, ship it"
"Just tweak the wording a bit"
"We don't need an eval set for this"
"The prompt is simple enough"
"We'll add evaluation later"
"Same prompt works across models"
"Users won't try to break it"
"Cost doesn't matter, use the biggest model"
"Just add more examples to fix it"
"The model should figure it out"

ALL of these mean: STOP. Return to Phase 1.

Common Rationalizations

Excuse Reality

"Works in the playground" Playground tests 3-5 cases. Production sees thousands of edge cases.

"Simple prompt, no eval needed" Simple prompts fail on edge cases you haven't imagined. Evaluate.

"We'll add tests later" Later means after the first production incident. Test now.

"Same prompt works across models" Models have different behaviors. Re-evaluate on every model change.

"Just add more few-shot examples" More examples without evaluation is guess-and-check. Measure first.

"Users won't try to break it" Users will absolutely try to break it. Test adversarial inputs.

"Cost doesn't matter" Cost scales with traffic. A 2x token reduction saves thousands.

"Bigger model fixes everything" Bigger model with a bad prompt is still bad. Fix the prompt.

"LLM evaluation is unreliable" LLM-as-judge with good rubrics correlates well with human eval. Calibrate it.

"Prompt engineering isn't real engineering" Untested prompts are untested code. Same discipline applies.

Anti-Patterns

Anti-Pattern Consequence Correct Approach

Untested prompts in production Silent failures, inconsistent outputs, user complaints Evaluation dataset, automated testing

No evaluation metrics Can't measure improvement, can't detect regression Define metrics per task type, track over time

Prompt injection vulnerabilities Data leaks, unauthorized actions, system prompt exposure Input sanitization, output validation, adversarial testing

Vibes-based tuning Fixes one case, breaks three others Systematic evaluation, regression testing

No versioning Can't rollback, can't compare, can't reproduce Version control prompts like code

Model coupling Prompt breaks on model update or migration Test across model versions, abstract model-specific syntax

Quick Reference

Phase Key Activities Success Criteria

Design Define task, select pattern, structure prompt, design output Clear prompt with explicit constraints
Anthropic XML tags, prefilling, caching, extended thinking Model-specific optimizations applied
RAG Retrieval testing, context management, grounding Faithful, cited, hallucination-resistant
Evaluation Build eval set, define metrics, test failure modes Meets accuracy targets, handles edge cases
Operations Version, A/B test, optimize cost, secure Versioned, monitored, cost-efficient, secure

Verification Checklist

Before deploying any prompt to production:

Task precisely defined with valid/invalid output examples
Prompt pattern selected with justification
Output format specified and constrained
Evaluation dataset created (minimum 20-50 examples)
Metrics defined and measured
Evaluation results meet defined thresholds
Adversarial inputs tested (prompt injection, edge cases)
Prompt versioned in source control
Token usage and cost measured
Model-specific optimizations applied (caching, prefilling)
Security review completed (no prompt leakage, input sanitized)
Rollback plan in place (previous prompt version ready)

Can't check all boxes? You're not ready to deploy.

Integration with Other Skills

This skill requires using:

test-driven-development - REQUIRED for building evaluation datasets and writing automated prompt tests

Complementary skills:

documentation-generation - Document prompt design decisions, evaluation results, and versioning strategy
systematic-debugging - Use when prompt behavior is inconsistent or outputs are unexpected

Final Rule

No eval dataset → no production deployment No metrics → no "improvement" No version control → no prompt changes

Design. Test. Evaluate. Version. Deploy. Monitor. In that order. Always.

prompt-engineering

Safety Notice

Copy this and send it to your AI assistant to learn

Every prompt change triggers evaluation

Compare against previous version

Source Transparency

Related Skills

repository-analyzer

prompt-engineering

executing-plans