Prompt Engineering
Overview
Untested prompts in production are bugs you haven't found yet. Vibes-based prompt tuning is not engineering.
Core principle: EVERY prompt is versioned, tested, and evaluated against ground truth before deployment.
Violating the letter of this process is violating the spirit of LLM engineering.
The Iron Law
EVERY PROMPT IS VERSIONED, TESTED, AND EVALUATED AGAINST GROUND TRUTH
If you haven't evaluated it on a test set, it's not ready for production. "It looked good in the playground" is not evaluation.
When to Use
Use for ANY LLM integration work:
-
Designing new prompts for applications
-
Modifying existing prompts
-
Building RAG pipelines
-
Implementing tool use / function calling
-
Optimizing token cost or latency
-
Migrating between models
-
Evaluating model outputs
Use this ESPECIALLY when:
-
Prompt "works most of the time"
-
You're tuning prompts by hand in a playground
-
Someone says "just tweak the prompt a bit"
-
Deploying prompt changes without evaluation
-
Switching models and assuming prompts transfer
Don't skip when:
-
The prompt is "simple" (simple prompts fail on edge cases)
-
You're "just fixing a typo" (typos change model behavior)
-
It's an internal tool (internal users deserve quality too)
The Five Phases
You MUST complete each phase before proceeding to the next.
Phase 1: Prompt Design
BEFORE writing ANY prompt:
Define the Task Precisely
-
What exactly should the model do?
-
What are valid outputs?
-
What are invalid outputs?
-
What edge cases exist?
-
Write these down. They become your evaluation criteria.
Select the Right Pattern
Pattern When to Use Example
Zero-shot Simple, well-defined tasks Classification, extraction
Few-shot Task needs examples to clarify format/behavior Structured data extraction, style matching
Chain-of-thought Reasoning, analysis, multi-step logic Math, code review, complex classification
System/User/Assistant roles Conversational applications Chatbots, assistants
Tool use Model needs to take actions or access data API calls, database queries, calculations
Structure the Prompt
Rules:
-
Report only confirmed issues, not style preferences
-
Include file path and line number for each issue
-
Classify severity as: critical, warning, info
Design Output Format
-
Specify exactly what the output should look like
-
Use JSON mode or tool use for structured output
-
Include examples of expected output in the prompt
-
Constrain the model: what it MUST include, what it MUST NOT include
Phase 2: Anthropic-Specific Best Practices
When using Claude models:
XML Tags for Structure
<document> {{document_content}} </document>
<instructions> Summarize the document above in 3 bullet points. Focus on actionable insights only. </instructions>
-
XML tags reduce ambiguity between instructions and content
-
Use them to separate input data from instructions
-
Use them to delineate sections of complex prompts
Prefilling for Format Control
Assistant: {"analysis": [
-
Start the assistant response to lock in format
-
Prevents preamble ("Sure, I'd be happy to...")
-
Forces specific output structure
Prompt Caching
-
Place stable content (system prompt, reference docs) first
-
Place variable content (user input) last
-
Use cache breakpoints for long static contexts
-
Measure cost savings: cached tokens are significantly cheaper
Extended Thinking
-
Enable for complex reasoning tasks
-
Budget thinking tokens appropriately
-
Don't enable for simple extraction/classification (waste of tokens)
Phase 3: RAG Design Patterns
When building retrieval-augmented generation:
Retrieval Quality First
-
Bad retrieval = bad generation, regardless of prompt quality
-
Test retrieval independently before testing generation
-
Measure retrieval recall: are relevant documents being found?
Context Window Management
<retrieved_documents> <document index="1" source="{{source_1}}" relevance_score="{{score_1}}"> {{content_1}} </document> <document index="2" source="{{source_2}}" relevance_score="{{score_2}}"> {{content_2}} </document> </retrieved_documents>
<instructions> Answer the user's question using ONLY the documents above. If the answer is not in the documents, say "I don't have enough information." Cite document numbers for each claim. </instructions>
Grounding and Attribution
-
Require citations to source documents
-
Instruct the model to say "I don't know" when information is missing
-
Test for hallucination: ask questions NOT in the context
-
Verify the model doesn't fabricate sources
Chunking Strategy
-
Chunk size affects retrieval quality
-
Too small: loses context
-
Too large: dilutes relevance
-
Test different chunk sizes and measure retrieval recall
Phase 4: Testing and Evaluation
BEFORE deploying ANY prompt:
Build an Evaluation Dataset
-
Minimum 20-50 examples for basic evaluation
-
Cover happy paths AND edge cases
-
Include adversarial inputs
-
Include ground truth (expected outputs)
-
Version your eval dataset alongside your prompts
Define Metrics
Task Type Metrics
Classification Accuracy, precision, recall, F1
Extraction Exact match, partial match, field-level accuracy
Generation LLM-as-judge, human eval, ROUGE/BLEU (limited)
RAG Faithfulness, relevance, citation accuracy
Run Evaluations Systematically
Every prompt change triggers evaluation
results = evaluate( prompt=prompt_v2, dataset=eval_dataset, metrics=[accuracy, faithfulness, latency], )
Compare against previous version
assert results.accuracy >= baseline.accuracy - REGRESSION_THRESHOLD assert results.faithfulness >= 0.95
Test for Failure Modes
-
Prompt injection attempts
-
Extremely long inputs
-
Empty or malformed inputs
-
Inputs in unexpected languages
-
Adversarial edge cases designed to break the prompt
LLM-as-Judge for Generation Quality
-
Use a separate LLM call to evaluate output quality
-
Define rubrics: what makes a good vs. bad output
-
Calibrate judge against human evaluations
-
Don't use the same model to judge itself when possible
Phase 5: Versioning and Operations
Every prompt in production follows these rules:
Version Control
prompts/ ├── code-review/ │ ├── v1.0.0.txt # Initial version │ ├── v1.1.0.txt # Added severity classification │ ├── v2.0.0.txt # Restructured for tool use │ ├── eval_dataset.jsonl # Test cases │ └── CHANGELOG.md # What changed and why
-
Semantic versioning: major.minor.patch
-
Major: behavior change. Minor: improvement. Patch: typo/formatting.
-
Every version has evaluation results recorded
A/B Testing
-
Route traffic between prompt versions
-
Measure real-world performance
-
Statistical significance before declaring winner
-
Don't declare "better" from 10 examples
Cost Optimization
-
Measure tokens per request (input and output)
-
Choose the right model for the task (don't use the largest model for simple classification)
-
Use prompt caching for repeated contexts
-
Batch requests where possible
-
Monitor cost per request in production
Security
-
Input sanitization before prompt injection
-
Output validation before returning to users
-
Rate limiting on LLM endpoints
-
Never expose system prompts to end users
-
Test for jailbreak and extraction attacks
Red Flags - STOP and Follow Process
If you catch yourself thinking:
-
"It works in the playground, ship it"
-
"Just tweak the wording a bit"
-
"We don't need an eval set for this"
-
"The prompt is simple enough"
-
"We'll add evaluation later"
-
"Same prompt works across models"
-
"Users won't try to break it"
-
"Cost doesn't matter, use the biggest model"
-
"Just add more examples to fix it"
-
"The model should figure it out"
ALL of these mean: STOP. Return to Phase 1.
Common Rationalizations
Excuse Reality
"Works in the playground" Playground tests 3-5 cases. Production sees thousands of edge cases.
"Simple prompt, no eval needed" Simple prompts fail on edge cases you haven't imagined. Evaluate.
"We'll add tests later" Later means after the first production incident. Test now.
"Same prompt works across models" Models have different behaviors. Re-evaluate on every model change.
"Just add more few-shot examples" More examples without evaluation is guess-and-check. Measure first.
"Users won't try to break it" Users will absolutely try to break it. Test adversarial inputs.
"Cost doesn't matter" Cost scales with traffic. A 2x token reduction saves thousands.
"Bigger model fixes everything" Bigger model with a bad prompt is still bad. Fix the prompt.
"LLM evaluation is unreliable" LLM-as-judge with good rubrics correlates well with human eval. Calibrate it.
"Prompt engineering isn't real engineering" Untested prompts are untested code. Same discipline applies.
Anti-Patterns
Anti-Pattern Consequence Correct Approach
Untested prompts in production Silent failures, inconsistent outputs, user complaints Evaluation dataset, automated testing
No evaluation metrics Can't measure improvement, can't detect regression Define metrics per task type, track over time
Prompt injection vulnerabilities Data leaks, unauthorized actions, system prompt exposure Input sanitization, output validation, adversarial testing
Vibes-based tuning Fixes one case, breaks three others Systematic evaluation, regression testing
No versioning Can't rollback, can't compare, can't reproduce Version control prompts like code
Model coupling Prompt breaks on model update or migration Test across model versions, abstract model-specific syntax
Quick Reference
Phase Key Activities Success Criteria
-
Design Define task, select pattern, structure prompt, design output Clear prompt with explicit constraints
-
Anthropic XML tags, prefilling, caching, extended thinking Model-specific optimizations applied
-
RAG Retrieval testing, context management, grounding Faithful, cited, hallucination-resistant
-
Evaluation Build eval set, define metrics, test failure modes Meets accuracy targets, handles edge cases
-
Operations Version, A/B test, optimize cost, secure Versioned, monitored, cost-efficient, secure
Verification Checklist
Before deploying any prompt to production:
-
Task precisely defined with valid/invalid output examples
-
Prompt pattern selected with justification
-
Output format specified and constrained
-
Evaluation dataset created (minimum 20-50 examples)
-
Metrics defined and measured
-
Evaluation results meet defined thresholds
-
Adversarial inputs tested (prompt injection, edge cases)
-
Prompt versioned in source control
-
Token usage and cost measured
-
Model-specific optimizations applied (caching, prefilling)
-
Security review completed (no prompt leakage, input sanitized)
-
Rollback plan in place (previous prompt version ready)
Can't check all boxes? You're not ready to deploy.
Integration with Other Skills
This skill requires using:
- test-driven-development - REQUIRED for building evaluation datasets and writing automated prompt tests
Complementary skills:
-
documentation-generation - Document prompt design decisions, evaluation results, and versioning strategy
-
systematic-debugging - Use when prompt behavior is inconsistent or outputs are unexpected
Final Rule
No eval dataset → no production deployment No metrics → no "improvement" No version control → no prompt changes
Design. Test. Evaluate. Version. Deploy. Monitor. In that order. Always.