LLM Evaluation & Testing
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
Quick Start
Initialize Promptfoo (no global install needed)
npx promptfoo@latest init
Run evaluation
npx promptfoo@latest eval
View results in browser
npx promptfoo@latest view
Run security scan
npx promptfoo@latest redteam run
Core Concepts
Why Evaluate?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
-
Regression detection: Catch quality drops before production
-
Security scanning: Find jailbreaks, injections, PII leaks
-
A/B comparison: Compare prompts/models side-by-side
-
CI/CD gates: Block bad changes from merging
Evaluation Types
Type Purpose Assertions
Functional Does it work? contains , equals , is-json
Semantic Is it correct? similar , llm-rubric , factuality
Performance Is it fast/cheap? cost , latency
Security Is it safe? redteam , moderation , pii-detection
Configuration
Basic promptfooconfig.yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
-
vars: question: "What is the capital of France?" assert:
- type: contains value: "Paris"
- type: cost threshold: 0.01
-
vars: question: "Explain quantum computing" assert:
- type: llm-rubric value: "Response explains quantum computing concepts clearly"
- type: latency threshold: 3000
With Environment Variables
providers:
- id: openrouter:anthropic/claude-3-5-sonnet config: apiKey: ${OPENROUTER_API_KEY}
Assertions Reference
Basic Assertions
assert:
String matching
- type: contains value: "expected text"
- type: not-contains value: "forbidden text"
- type: equals value: "exact match"
- type: starts-with value: "prefix"
- type: regex value: "\d{4}-\d{2}-\d{2}" # Date pattern
JSON validation
- type: is-json
- type: is-valid-json-schema value: type: object properties: name: { type: string } required: [name]
Semantic Assertions
assert:
Semantic similarity (embeddings)
- type: similar value: "The capital of France is Paris" threshold: 0.8 # 0-1 similarity score
LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
- Be factually accurate
- Be under 100 words
- Not contain marketing language
Factuality check against reference
- type: factuality value: "Paris is the capital of France"
Performance Assertions
assert:
Cost budget (USD)
- type: cost threshold: 0.05 # Max $0.05 per request
Latency (milliseconds)
- type: latency threshold: 2000 # Max 2 seconds
Security Assertions
assert:
Content moderation
- type: moderation value: violence
PII detection
- type: not-contains value: "{{email}}" # From test vars
CI/CD Integration
GitHub Action
name: 'Prompt Evaluation' on: pull_request: paths: ['prompts/', 'src//prompt']
jobs: evaluate: runs-on: ubuntu-latest permissions: pull-requests: write steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keys
Quality Gates
promptfooconfig.yaml
evaluateOptions:
Fail if any assertion fails
maxConcurrency: 5
Or set pass threshold
threshold: 0.9 # 90% of tests must pass
Output to JSON (for custom CI)
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi
Security Testing (Red Team)
Quick Scan
Run red team against your prompts
npx promptfoo@latest redteam run
Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
Configuration
promptfooconfig.yaml
redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection
strategies: - jailbreak - prompt-injection - base64 - leetspeak
OWASP Top 10 Coverage
redteam: plugins: # 1. Prompt Injection - prompt-injection # 2. Insecure Output Handling - harmful:privacy # 3. Training Data Poisoning (N/A for evals) # 4. Model Denial of Service - excessive-agency # 5. Supply Chain (N/A for evals) # 6. Sensitive Information Disclosure - pii:direct - pii:session # 7. Insecure Plugin Design - hijacking # 8. Excessive Agency - excessive-agency # 9. Overreliance (use factuality checks) # 10. Model Theft (N/A for evals)
RAG Evaluation
Context-Aware Testing
prompts:
- | Context: {{context}} Question: {{question}} Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains value: "1889"
- type: factuality value: "The Eiffel Tower was built in 1889"
- type: not-contains value: "1900" # Common hallucination
Retrieval Quality
Test that retrieval returns relevant documents
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # Shouldn't punt on this query
Comparing Models/Prompts
A/B Testing
Compare two prompts
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
Same tests for both
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric value: "Clear explanation of recursion with example"
Model Comparison
Compare multiple models
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
Run: npx promptfoo@latest eval
View: npx promptfoo@latest view
Compare cost, latency, quality side-by-side
Best Practices
- Golden Test Cases
Maintain a set of critical test cases that must always pass:
golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains value: "expected output" options: critical: true # Fail entire suite if this fails
- Regression Suite Structure
prompts/ ├── production.txt # Current production prompt ├── candidate.txt # New prompt being tested tests/ ├── golden/ # Critical tests (run on every PR) │ └── core-functionality.yaml ├── regression/ # Full regression suite (nightly) │ └── full-suite.yaml └── security/ # Red team tests └── redteam.yaml
- Test Categories
tests:
Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains value: "4"
Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains value: "error"
Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains value: "Here's how to" # Should refuse
References
-
references/promptfoo-guide.md
-
Detailed setup and configuration
-
references/evaluation-metrics.md
-
Metrics deep dive
-
references/ci-cd-integration.md
-
CI/CD patterns
-
references/alternatives.md
-
Braintrust, DeepEval, LangSmith comparison
Templates
Copy-paste ready templates:
-
templates/promptfooconfig.yaml
-
Basic config
-
templates/github-action-eval.yml
-
GitHub Action
-
templates/regression-test-suite.yaml
-
Full regression suite