Evaluate Presets

Overview

Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.

When to Use

Testing preset configurations after changes
Auditing the preset library for quality
Validating new presets work correctly
After modifying hat routing logic

Quick Start

Evaluate a single preset:

./tools/evaluate-preset.sh tdd-red-green claude

Evaluate all presets:

./tools/evaluate-all-presets.sh claude

Arguments:

First arg: preset name (without .yml extension)
Second arg: backend (claude or kiro , defaults to claude )

Bash Tool Configuration

IMPORTANT: When invoking these scripts via the Bash tool, use these settings:

Single preset evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true
All presets evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true

Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.

Example invocation pattern:

Bash tool with: command: "./tools/evaluate-preset.sh tdd-red-green claude" timeout: 600000 run_in_background: true

After launching, use TaskOutput with block: false to check status without waiting for completion.

What the Scripts Do

evaluate-preset.sh

Loads test task from tools/preset-test-tasks.yml (if yq available)
Creates merged config with evaluation settings
Runs Ralph with --record-session for metrics capture
Captures output logs, exit codes, and timing
Extracts metrics: iterations, hats activated, events published

Output structure:

.eval/ ├── logs/<preset>/<timestamp>/ │ ├── output.log # Full stdout/stderr │ ├── session.jsonl # Recorded session │ ├── metrics.json # Extracted metrics │ ├── environment.json # Runtime environment │ └── merged-config.yml # Config used └── logs/<preset>/latest -> <timestamp>

evaluate-all-presets.sh

Runs all 12 presets sequentially and generates a summary:

.eval/results/<suite-id>/ ├── SUMMARY.md # Markdown report ├── <preset>.json # Per-preset metrics └── latest -> <suite-id>

Presets Under Evaluation

Preset Test Task

tdd-red-green

Add is_palindrome() function

adversarial-review

Review user input handler for security

socratic-learning

Understand HatRegistry

spec-driven

Specify and implement StringUtils::truncate()

mob-programming

Implement a Stack data structure

scientific-method

Debug failing mock test assertion

code-archaeology

Understand history of config.rs

performance-optimization

Profile hat matching

api-design

Design a Cache trait

documentation-first

Document RateLimiter

incident-response

Respond to "tests failing in CI"

migration-safety

Plan v1 to v2 config migration

Interpreting Results

Exit codes from evaluate-preset.sh :

0 — Success (LOOP_COMPLETE reached)
124 — Timeout (preset hung or took too long)
Other — Failure (check output.log )

Metrics in metrics.json :

iterations — How many event loop cycles
hats_activated — Which hats were triggered
events_published — Total events emitted
completed — Whether completion promise was reached

Hat Routing Performance

Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").

What Good Looks Like

Each hat should execute in its own iteration:

Iter 1: Ralph → publishes starting event → STOPS Iter 2: Hat A → does work → publishes next event → STOPS Iter 3: Hat B → does work → publishes next event → STOPS Iter 4: Hat C → does work → LOOP_COMPLETE

Red Flags (Same-Iteration Hat Switching)

BAD: Multiple hat personas in one iteration:

Iter 2: Ralph does Blue Team + Red Team + Fixer work ^^^ All in one bloated context!

How to Check

Count iterations vs events in session.jsonl :

Count iterations

grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log

Count events published

grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl

Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)

Check for same-iteration hat switching in output.log :

grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to"
.eval/logs/<preset>/latest/output.log

Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.

Check event timestamps in session.jsonl :

cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'

Red flag: Multiple events with identical timestamps (published in same iteration).

Routing Performance Triage

Pattern Diagnosis Action

iterations ≈ events ✅ Good Hat routing working

iterations << events ⚠️ Same-iteration switching Check prompt has STOP instruction

iterations >> events ⚠️ Recovery loops Agent not publishing required events

0 events ❌ Broken Events not being read from JSONL

Root Cause Checklist

If hat routing is broken:

Check workflow prompt in hatless_ralph.rs :

Does it say "CRITICAL: STOP after publishing"?
Is the DELEGATE section clear about yielding control?

Check hat instructions propagation:

Does HatInfo include instructions field?
Are instructions rendered in the ## HATS section?

Check events context:

Is build_prompt(context) using the context parameter?
Does prompt include ## PENDING EVENTS section?

Autonomous Fix Workflow

After evaluation, delegate fixes to subagents:

Step 1: Triage Results

Read .eval/results/latest/SUMMARY.md and identify:

❌ FAIL → Create code tasks for fixes
⏱️ TIMEOUT → Investigate infinite loops
⚠️ PARTIAL → Check for edge cases

Step 2: Dispatch Task Creation

For each issue, spawn a Task agent:

"Use /code-task-generator to create a task for fixing: [issue from evaluation] Output to: tasks/preset-fixes/"

Step 3: Dispatch Implementation

For each created task:

"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md Mode: auto"

Step 4: Re-evaluate

./tools/evaluate-preset.sh <fixed-preset> claude

Prerequisites

yq (optional): For loading test tasks from YAML. Install: brew install yq
Cargo: Must be able to build Ralph

Related Files

tools/evaluate-preset.sh — Single preset evaluation
tools/evaluate-all-presets.sh — Full suite evaluation
tools/preset-test-tasks.yml — Test task definitions
tools/preset-evaluation-findings.md — Manual findings doc
presets/ — The preset collection being evaluated

evaluate-presets

Safety Notice

Copy this and send it to your AI assistant to learn

Count iterations

Count events published

Source Transparency

Related Skills

code-assist

code-task-generator

find-code-tasks