Evaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
When to Use
-
Testing preset configurations after changes
-
Auditing the preset library for quality
-
Validating new presets work correctly
-
After modifying hat routing logic
Quick Start
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
-
First arg: preset name (without .yml extension)
-
Second arg: backend (claude or kiro , defaults to claude )
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
-
Single preset evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true
-
All presets evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with: command: "./tools/evaluate-preset.sh tdd-red-green claude" timeout: 600000 run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
What the Scripts Do
evaluate-preset.sh
-
Loads test task from tools/preset-test-tasks.yml (if yq available)
-
Creates merged config with evaluation settings
-
Runs Ralph with --record-session for metrics capture
-
Captures output logs, exit codes, and timing
-
Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/ ├── logs/<preset>/<timestamp>/ │ ├── output.log # Full stdout/stderr │ ├── session.jsonl # Recorded session │ ├── metrics.json # Extracted metrics │ ├── environment.json # Runtime environment │ └── merged-config.yml # Config used └── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.sh
Runs all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/ ├── SUMMARY.md # Markdown report ├── <preset>.json # Per-preset metrics └── latest -> <suite-id>
Presets Under Evaluation
Preset Test Task
tdd-red-green
Add is_palindrome() function
adversarial-review
Review user input handler for security
socratic-learning
Understand HatRegistry
spec-driven
Specify and implement StringUtils::truncate()
mob-programming
Implement a Stack data structure
scientific-method
Debug failing mock test assertion
code-archaeology
Understand history of config.rs
performance-optimization
Profile hat matching
api-design
Design a Cache trait
documentation-first
Document RateLimiter
incident-response
Respond to "tests failing in CI"
migration-safety
Plan v1 to v2 config migration
Interpreting Results
Exit codes from evaluate-preset.sh :
-
0 — Success (LOOP_COMPLETE reached)
-
124 — Timeout (preset hung or took too long)
-
Other — Failure (check output.log )
Metrics in metrics.json :
-
iterations — How many event loop cycles
-
hats_activated — Which hats were triggered
-
events_published — Total events emitted
-
completed — Whether completion promise was reached
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS Iter 2: Hat A → does work → publishes next event → STOPS Iter 3: Hat B → does work → publishes next event → STOPS Iter 4: Hat C → does work → LOOP_COMPLETE
Red Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work ^^^ All in one bloated context!
How to Check
- Count iterations vs events in session.jsonl :
Count iterations
grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log
Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
- Check for same-iteration hat switching in output.log :
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to"
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
- Check event timestamps in session.jsonl :
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
Pattern Diagnosis Action
iterations ≈ events ✅ Good Hat routing working
iterations << events ⚠️ Same-iteration switching Check prompt has STOP instruction
iterations >> events ⚠️ Recovery loops Agent not publishing required events
0 events ❌ Broken Events not being read from JSONL
Root Cause Checklist
If hat routing is broken:
Check workflow prompt in hatless_ralph.rs :
-
Does it say "CRITICAL: STOP after publishing"?
-
Is the DELEGATE section clear about yielding control?
Check hat instructions propagation:
-
Does HatInfo include instructions field?
-
Are instructions rendered in the ## HATS section?
Check events context:
-
Is build_prompt(context) using the context parameter?
-
Does prompt include ## PENDING EVENTS section?
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read .eval/results/latest/SUMMARY.md and identify:
-
❌ FAIL → Create code tasks for fixes
-
⏱️ TIMEOUT → Investigate infinite loops
-
⚠️ PARTIAL → Check for edge cases
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation] Output to: tasks/preset-fixes/"
Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md Mode: auto"
Step 4: Re-evaluate
./tools/evaluate-preset.sh <fixed-preset> claude
Prerequisites
-
yq (optional): For loading test tasks from YAML. Install: brew install yq
-
Cargo: Must be able to build Ralph
Related Files
-
tools/evaluate-preset.sh — Single preset evaluation
-
tools/evaluate-all-presets.sh — Full suite evaluation
-
tools/preset-test-tasks.yml — Test task definitions
-
tools/preset-evaluation-findings.md — Manual findings doc
-
presets/ — The preset collection being evaluated