Eval Harness Skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Philosophy
Eval-Driven Development treats evals as the "unit tests of AI development":
-
Define expected behavior BEFORE implementation
-
Run evals continuously during development
-
Track regressions with each change
-
Use pass@k metrics for reliability measurement
Eval Types
Capability Evals
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria:
- Criterion 1
- Criterion 2
- Criterion 3 Expected Output: Description of expected result
Regression Evals
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y)
Grader Types
- Code-Based Grader
Deterministic checks using code:
Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
- Model-Based Grader
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT] Evaluate the following code change:
- Does it solve the stated problem?
- Is it well-structured?
- Are edge cases handled?
- Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent) Reasoning: [explanation]
- Human Grader
Flag for manual review:
[HUMAN REVIEW REQUIRED] Change: Description of what changed Reason: Why human review is needed Risk Level: LOW/MEDIUM/HIGH
Metrics
pass@k
"At least one success in k attempts"
-
pass@1: First attempt success rate
-
pass@3: Success within 3 attempts
-
Typical target: pass@3 > 90%
pass^k
"All k trials succeed"
-
Higher bar for reliability
-
pass^3: 3 consecutive successes
-
Use for critical paths
Eval Workflow
- Define (Before Coding)
EVAL DEFINITION: feature-xyz
Capability Evals
- Can create new user account
- Can validate email format
- Can hash password securely
Regression Evals
- Existing login still works
- Session management unchanged
- Logout flow intact
Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
- Implement
Write code to pass the defined evals.
- Evaluate
Run capability evals
[Run each capability eval, record PASS/FAIL]
Run regression evals
npm test -- --testPathPattern="existing"
Generate report
- Report
EVAL REPORT: feature-xyz
Capability Evals: create-user: PASS (pass@1) validate-email: PASS (pass@2) hash-password: PASS (pass@1) Overall: 3/3 passed
Regression Evals: login-flow: PASS session-mgmt: PASS logout-flow: PASS Overall: 3/3 passed
Metrics: pass@1: 67% (2/3) pass@3: 100% (3/3)
Status: READY FOR REVIEW
Integration Patterns
Pre-Implementation
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
During Implementation
/eval check feature-name
Runs current evals and reports status
Post-Implementation
/eval report feature-name
Generates full eval report
Eval Storage
Store evals in project:
.claude/ evals/ feature-xyz.md # Eval definition feature-xyz.log # Eval run history baseline.json # Regression baselines
Best Practices
-
Define evals BEFORE coding - Forces clear thinking about success criteria
-
Run evals frequently - Catch regressions early
-
Track pass@k over time - Monitor reliability trends
-
Use code graders when possible - Deterministic > probabilistic
-
Human review for security - Never fully automate security checks
-
Keep evals fast - Slow evals don't get run
-
Version evals with code - Evals are first-class artifacts
Example: Adding Authentication
EVAL: add-authentication
Phase 1: Define (10 min)
Capability Evals:
- User can register with email/password
- User can login with valid credentials
- Invalid credentials rejected with proper error
- Sessions persist across page reloads
- Logout clears session
Regression Evals:
- Public routes still accessible
- API responses unchanged
- Database schema compatible
Phase 2: Implement (varies)
[Write code]
Phase 3: Evaluate
Run: /eval check add-authentication
Phase 4: Report
EVAL REPORT: add-authentication
Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT