LLM Evaluation & Testing

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.

Quick Start

Initialize Promptfoo (no global install needed)

npx promptfoo@latest init

Run evaluation

npx promptfoo@latest eval

View results in browser

npx promptfoo@latest view

Run security scan

npx promptfoo@latest redteam run

Core Concepts

Why Evaluate?

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:

Regression detection: Catch quality drops before production
Security scanning: Find jailbreaks, injections, PII leaks
A/B comparison: Compare prompts/models side-by-side
CI/CD gates: Block bad changes from merging

Evaluation Types

Type Purpose Assertions

Functional Does it work? contains , equals , is-json

Semantic Is it correct? similar , llm-rubric , factuality

Performance Is it fast/cheap? cost , latency

Security Is it safe? redteam , moderation , pii-detection

Configuration

Basic promptfooconfig.yaml

description: "My LLM evaluation suite"

prompts:

file://prompts/main.txt

providers:

openai:gpt-4o-mini
anthropic:claude-3-5-sonnet-latest

tests:

vars: question: "What is the capital of France?" assert:
- type: contains value: "Paris"
- type: cost threshold: 0.01
vars: question: "Explain quantum computing" assert:
- type: llm-rubric value: "Response explains quantum computing concepts clearly"
- type: latency threshold: 3000

With Environment Variables

providers:

id: openrouter:anthropic/claude-3-5-sonnet config: apiKey: ${OPENROUTER_API_KEY}

Assertions Reference

Basic Assertions

assert:

String matching

type: contains value: "expected text"
type: not-contains value: "forbidden text"
type: equals value: "exact match"
type: starts-with value: "prefix"
type: regex value: "\d{4}-\d{2}-\d{2}" # Date pattern

JSON validation

type: is-json
type: is-valid-json-schema value: type: object properties: name: { type: string } required: [name]

Semantic Assertions

assert:

Semantic similarity (embeddings)

type: similar value: "The capital of France is Paris" threshold: 0.8 # 0-1 similarity score

LLM-as-judge with custom criteria

type: llm-rubric value: | Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language

Factuality check against reference

type: factuality value: "Paris is the capital of France"

Performance Assertions

assert:

Cost budget (USD)

type: cost threshold: 0.05 # Max $0.05 per request

Latency (milliseconds)

type: latency threshold: 2000 # Max 2 seconds

Security Assertions

assert:

Content moderation

type: moderation value: violence

PII detection

type: not-contains value: "{{email}}" # From test vars

CI/CD Integration

GitHub Action

name: 'Prompt Evaluation' on: pull_request: paths: ['prompts/', 'src//prompt']

jobs: evaluate: runs-on: ubuntu-latest permissions: pull-requests: write steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20'

  # Cache for faster runs
  - uses: actions/cache@v4
    with:
      path: ~/.promptfoo
      key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

  # Run evaluation and post results to PR
  - uses: promptfoo/promptfoo-action@v1
    with:
      github-token: ${{ secrets.GITHUB_TOKEN }}
      openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys

Quality Gates

promptfooconfig.yaml

evaluateOptions:

Fail if any assertion fails

maxConcurrency: 5

Or set pass threshold

threshold: 0.9 # 90% of tests must pass

Output to JSON (for custom CI)

npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

Check results in CI script

if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi

Security Testing (Red Team)

Quick Scan

Run red team against your prompts

npx promptfoo@latest redteam run

Generate compliance report

npx promptfoo@latest redteam report --output compliance.html

Configuration

promptfooconfig.yaml

redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection

strategies: - jailbreak - prompt-injection - base64 - leetspeak

OWASP Top 10 Coverage

redteam: plugins: # 1. Prompt Injection - prompt-injection # 2. Insecure Output Handling - harmful:privacy # 3. Training Data Poisoning (N/A for evals) # 4. Model Denial of Service - excessive-agency # 5. Supply Chain (N/A for evals) # 6. Sensitive Information Disclosure - pii:direct - pii:session # 7. Insecure Plugin Design - hijacking # 8. Excessive Agency - excessive-agency # 9. Overreliance (use factuality checks) # 10. Model Theft (N/A for evals)

RAG Evaluation

Context-Aware Testing

prompts:

| Context: {{context}} Question: {{question}} Answer based only on the context provided.

tests:

vars: context: "The Eiffel Tower was built in 1889 for the World's Fair." question: "When was the Eiffel Tower built?" assert:
- type: contains value: "1889"
- type: factuality value: "The Eiffel Tower was built in 1889"
- type: not-contains value: "1900" # Common hallucination

Retrieval Quality

Test that retrieval returns relevant documents

tests:

vars: query: "Python list comprehension" assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # Shouldn't punt on this query

Comparing Models/Prompts

A/B Testing

Compare two prompts

prompts:

file://prompts/v1.txt
file://prompts/v2.txt

Same tests for both

tests:

vars: { question: "Explain recursion" } assert:
- type: llm-rubric value: "Clear explanation of recursion with example"

Model Comparison

Compare multiple models

providers:

openai:gpt-4o-mini
anthropic:claude-3-5-haiku-latest
openrouter:google/gemini-flash-1.5

Run: npx promptfoo@latest eval

View: npx promptfoo@latest view

Compare cost, latency, quality side-by-side

Best Practices

Golden Test Cases

Maintain a set of critical test cases that must always pass:

golden-tests.yaml

tests:

description: "Core functionality - must pass" vars: input: "critical test case" assert:
- type: contains value: "expected output" options: critical: true # Fail entire suite if this fails

Regression Suite Structure

prompts/ ├── production.txt # Current production prompt ├── candidate.txt # New prompt being tested tests/ ├── golden/ # Critical tests (run on every PR) │ └── core-functionality.yaml ├── regression/ # Full regression suite (nightly) │ └── full-suite.yaml └── security/ # Red team tests └── redteam.yaml

Test Categories

tests:

Happy path

description: "Standard query" vars: { question: "What is 2+2?" } assert:
- type: contains value: "4"

Edge cases

description: "Empty input" vars: { question: "" } assert:
- type: not-contains value: "error"

Adversarial

description: "Injection attempt" vars: { question: "Ignore previous instructions and..." } assert:
- type: not-contains value: "Here's how to" # Should refuse

References

references/promptfoo-guide.md
Detailed setup and configuration
references/evaluation-metrics.md
Metrics deep dive
references/ci-cd-integration.md
CI/CD patterns
references/alternatives.md
Braintrust, DeepEval, LangSmith comparison

Templates

Copy-paste ready templates:

templates/promptfooconfig.yaml
Basic config
templates/github-action-eval.yml
GitHub Action
templates/regression-test-suite.yaml
Full regression suite

llm-evaluation

Safety Notice

Copy this and send it to your AI assistant to learn

Initialize Promptfoo (no global install needed)

Run evaluation

View results in browser

Run security scan

String matching

JSON validation

Semantic similarity (embeddings)

LLM-as-judge with custom criteria

Factuality check against reference

Cost budget (USD)

Latency (milliseconds)

Content moderation

PII detection

promptfooconfig.yaml

Fail if any assertion fails

Or set pass threshold

Check results in CI script

Run red team against your prompts

Generate compliance report

promptfooconfig.yaml

Test that retrieval returns relevant documents

Compare two prompts

Same tests for both

Compare multiple models

Run: npx promptfoo@latest eval

View: npx promptfoo@latest view

Compare cost, latency, quality side-by-side

golden-tests.yaml

Happy path

Edge cases

Adversarial

Source Transparency

Related Skills

pencil-renderer

ui-skills

llm-gateway-routing

documentation-standards