Agent Eval Harness

Purpose

CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun.

The harness captures. You score.

Harness Provides You Provide

Prompt execution via headless adapters Scoring logic (Braintrust, custom scripts)

Full trajectory capture (thoughts, tools, plans) Pass/fail determination via graders

Structured JSONL output LLM-as-judge prompts

Reproducible execution environment CI integration, golden file comparison

Use this when:

Capturing trajectories for downstream evaluation
Generating training data (SFT/DPO) with full context
Building regression test fixtures for agent behavior
Comparing agent responses across configurations

Installation

Run without installing (recommended)

bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

Or install as project dependency

bun add @plaited/agent-eval-harness

Core Principle: Capture Once, Derive Many Views

flowchart LR Prompts["prompts.jsonl"] --> Capture["capture/trials"] Schema["headless schema"] --> Capture Capture --> Results["results.jsonl (full trajectory)"] Results --> Summarize["summarize"] Results --> Calibrate["calibrate"] Results --> Custom["(your tools)"] Summarize --> Views["summary.jsonl / .md"] Calibrate --> Report["calibration.md"] Custom --> Pipeline["any scoring platform"]

Single output format: Full trajectory JSONL (always) No --format flag: Derive views with separate commands Schema exports: Zod schemas + JSON Schema for any tooling

Commands

Core Commands

Command Input Output Purpose

capture

prompts.jsonl + schema results.jsonl Trajectory capture (full)

trials

prompts.jsonl + schema trials.jsonl Multi-run + optional metrics

summarize

results.jsonl summary.jsonl or .md Derive compact views

calibrate

results.jsonl calibration.md Sample failures for review

validate-refs

prompts.jsonl validation.jsonl Check reference solutions

balance

prompts.jsonl balance.json Analyze test set coverage

schemas

(none) JSON Schema Export schemas for non-TS users

Pipeline Commands (Unix-style)

Command Input Output Purpose

run

prompts.jsonl + schema raw.jsonl Execute prompts, raw output

extract

raw.jsonl + schema extracted.jsonl Parse trajectories

grade

extracted.jsonl + grader graded.jsonl Apply grader scoring

format

results.jsonl jsonl/markdown/csv Convert output format

compare

multiple results.jsonl comparison.json Compare runs (aggregate report)

All commands support optional --grader ./grader.ts for scoring.

Capture Command

Basic Usage

bunx @plaited/agent-eval-harness capture <prompts.jsonl> --schema <schema.json> [options]

Arguments

Argument/Flag Description Default

prompts.jsonl

Input file with prompts to execute Required

-s, --schema

Path to headless adapter schema Required

-o, --output

Output file/path stdout

-c, --cwd

Working directory for agent current

-t, --timeout

Request timeout in ms 60000

-j, --concurrency

Number of concurrent workers 1

--workspace-dir

Base directory for per-prompt workspace isolation none

--progress

Show progress to stderr false

--append

Append to output file false

-g, --grader

Path to grader module none

--debug

Show detailed CLI output for debugging false

Examples

Basic capture

bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

Parallel execution (4x faster with 4 workers)

bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -j 4 -o results.jsonl

With workspace isolation for code generation tasks

bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json
-j 4 --workspace-dir ./workspaces -o results.jsonl

Using a local adapter script

bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl

With grader (adds score to each result)

bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl

Trials Command

Run each prompt multiple times for pass@k/pass^k analysis.

Capture only (no grader)

bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl

With grader (computes pass@k, pass^k)

bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl

Parallel execution (4 prompts' trials run concurrently)

bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 -o trials.jsonl

With workspace isolation (each trial gets its own directory)

bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4
--workspace-dir ./workspaces -o trials.jsonl

Parallelization notes:

-j/--concurrency parallelizes across prompts (not trials within a prompt)
Each prompt's k trials still run sequentially (required for aggregation)
With 151 prompts and -j 4 , you get 4 prompts running trials concurrently
--workspace-dir creates {workspace-dir}/prompt-{id}-trial-{n}/ for each trial
Progress logging shows aggregate completion (e.g., 12/50 prompts completed )

Workspace cleanup: Directories persist after completion for debugging. Clean up manually:

After capture

rm -rf ./workspaces

In CI (add as post-step)

run: rm -rf ./workspaces if: always()

Output

Without grader:

{"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]}

With grader:

{"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]}

Summarize Command

Derive compact views from full trajectory results.

Summary JSONL (for jq analysis)

bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

Markdown (for LLM-as-judge)

bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md

Calibrate Command

Sample failures for grader review. Calibration helps you distinguish between agent failures (agent did wrong thing) and grader bugs (agent was correct, grader too strict).

Sample failures for human review

bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md

Re-score with different grader to compare

bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md

See eval-concepts.md for why calibration matters.

Validate-Refs Command

Check that reference solutions pass your grader before evaluating agents.

Validate reference solutions

bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl

Check for failures

cat validation.jsonl | jq 'select(.pass == false)'

Why Use This?

If your reference solution fails your own grader:

The task definition is ambiguous
The grader is too strict
The hint is wrong

Fix the eval before evaluating the agent.

Input Format

Prompts must include a reference field:

{"id":"test-001","input":"Create a button component","hint":"<button>","reference":"export const Button = () => <button>Click</button>"}

Output Format

{"id":"test-001","input":"Create a button component","reference":"export const Button = () => <button>Click</button>","pass":true,"score":1.0,"reasoning":"Contains hint content"}

Balance Command

Analyze test set coverage to ensure balanced evaluation.

Analyze prompt distribution

bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json

Pretty print

bunx @plaited/agent-eval-harness balance prompts.jsonl | jq .

Why Use This?

An eval with only "make X work" misses "don't break Y". Balance analysis shows:

Category distribution (from metadata.category )
Positive/negative case ratio
Coverage gaps

Output Format

{ "totalCases": 50, "categories": [ { "name": "ui", "count": 20, "percentage": 40 }, { "name": "logic", "count": 15, "percentage": 30 }, { "name": "api", "count": 10, "percentage": 20 }, { "name": "edge-case", "count": 5, "percentage": 10 } ], "underrepresented": ["edge-case"], "suggestions": ["Consider adding more test cases for: edge-case"] }

Balanced Eval Design

Include both positive and negative cases:

Type Example Purpose

Positive "Add a login button" Agent should succeed

Negative "Add a button without breaking tests" Agent should not break things

Edge case "Handle empty input gracefully" Agent should be robust

See eval-concepts.md for more on balanced test sets.

Pipeline Workflow

The pipeline commands enable Unix-style composition for flexible evaluation workflows.

Full Pipeline Example

Execute → Extract → Grade → Format in one pipeline

cat prompts.jsonl |
bunx @plaited/agent-eval-harness run -s claude.json |
bunx @plaited/agent-eval-harness extract -s claude.json |
bunx @plaited/agent-eval-harness grade -g ./grader.ts |
bunx @plaited/agent-eval-harness format -f markdown > report.md

Run Command

Execute prompts and output raw results. Three modes available:

Schema mode (recommended)

bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json

Simple mode: {} placeholder substitution

bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json"

Shell mode: $PROMPT environment variable

bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json'

⚠️ Security Warning: The --simple and --shell modes execute prompts via shell commands. Prompts are escaped but do not use untrusted prompt content with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use --schema mode (headless adapter) for untrusted inputs.

Extract Command

Parse raw output into structured trajectories:

From file

bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl

Piped from run

bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json |
bunx @plaited/agent-eval-harness extract -s claude.json

Grade Command

Apply grader to extracted results:

bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl

Format Command

Convert results to different output formats:

Markdown report

bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md

CSV for spreadsheets

bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv

JSONL (pass-through, default)

bunx @plaited/agent-eval-harness format results.jsonl --style jsonl

Compare Command

Compare multiple runs of the same prompts. Supports both CaptureResult (single-run) and TrialResult (multi-run reliability) formats with auto-detection.

Default: auto-detect format, weighted strategy, JSON output

bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Statistical significance strategy

bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --strategy statistical -o comparison.json

Custom weights via environment variables (CaptureResult)

COMPARE_QUALITY=0.7 COMPARE_LATENCY=0.2 COMPARE_RELIABILITY=0.1
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Markdown report format

bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --format markdown -o report.md

Custom grader (LLM-as-Judge)

bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl
--strategy custom --grader ./my-llm-judge.ts -o comparison.json

With explicit labels

bunx @plaited/agent-eval-harness compare
--run "with-mcp:results-mcp.jsonl"
--run "vanilla:results-vanilla.jsonl"
-o comparison.json

Use cases for compare:

Same agent, different MCP servers
Same agent, different skills enabled
Same agent, different model versions
Different agents entirely

Trials Comparison (pass@k Analysis)

Compare TrialResult files for reliability analysis:

Auto-detect trials format

bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

Explicit format (skip auto-detection)

bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl --input-format trials -o comparison.json

Custom weights for trials comparison

COMPARE_CAPABILITY=0.5 COMPARE_RELIABILITY=0.3 COMPARE_CONSISTENCY=0.2
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

Trials metrics:

Metric Description Formula

Capability (passAtK) Can solve at least once in K tries 1 - (1-p)^k

Reliability (passExpK) Solves consistently every time p^k

Flakiness Gap between capability and reliability passAtK - passExpK

Quality (scores) Aggregate grader scores across trials avg/median/p25/p75 (only with grader)

Performance (latency) Aggregate trial durations p50/p90/p99/mean/min/max (always present)

Built-in Comparison Strategies

For CaptureResult (single-run):

Strategy Description Env Vars

weighted (default) Quality, latency, reliability COMPARE_QUALITY , COMPARE_LATENCY , COMPARE_RELIABILITY

statistical

Bootstrap for confidence intervals COMPARE_BOOTSTRAP_ITERATIONS

custom

Your own grader --grader path

For TrialResult (multi-run):

Strategy Description Env Vars

weighted (default) Capability, reliability, consistency COMPARE_CAPABILITY , COMPARE_RELIABILITY , COMPARE_CONSISTENCY

statistical

Bootstrap passAtK confidence intervals COMPARE_BOOTSTRAP_ITERATIONS

custom

Your own grader --grader path

Comparison Report Output

CaptureResult format outputs ComparisonReport :

{ "meta": { "generatedAt": "...", "runs": ["baseline", "variant"], "promptCount": 100 }, "quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82 }, "variant": { ... } }, "performance": { "baseline": { "latency": { "p50": 1200, "p90": 3400 } }, ... }, "reliability": { "baseline": { "type": "run", "toolErrors": 5, "completionRate": 0.99 }, ... }, "headToHead": { "pairwise": [{ "runA": "baseline", "runB": "variant", "aWins": 35, "bWins": 55 }] } }

With --strategy statistical , quality and performance metrics include 95% confidence intervals:

{ "quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82, "confidenceIntervals": { "avgScore": [0.82, 0.88], "passRate": [0.79, 0.85] } } }, "performance": { "baseline": { "latency": { "p50": 1200, "mean": 1350 }, "confidenceIntervals": { "latencyMean": [1280, 1420] } } } }

TrialResult format outputs TrialsComparisonReport :

{ "meta": { "generatedAt": "...", "runs": ["claude", "gemini"], "promptCount": 50, "trialsPerPrompt": 5, "inputFormat": "trials" }, "capability": { "claude": { "avgPassAtK": 0.92, "medianPassAtK": 0.95 }, "gemini": { "..." : "..." } }, "reliability": { "claude": { "type": "trial", "avgPassExpK": 0.78, "medianPassExpK": 0.82 }, "gemini": { "..." : "..." } }, "flakiness": { "claude": { "avgFlakiness": 0.14, "flakyPromptCount": 12 }, "gemini": { "..." : "..." } }, "quality": { "claude": { "avgScore": 0.85, "medianScore": 0.90, "p25Score": 0.75, "p75Score": 0.95 }, "gemini": { "..." : "..." } }, "performance": { "claude": { "latency": { "p50": 1200, "p90": 3400, "p99": 5100, "mean": 1500, "min": 800, "max": 5200 }, "totalDuration": 375000 }, "gemini": { "..." : "..." } }, "headToHead": { "capability": [{ "runA": "claude", "runB": "gemini", "aWins": 28, "bWins": 18, "ties": 4 }], "reliability": ["..."], "overall": ["..."] } }

Notes:

quality is only present when a grader was used (trials have score fields)
performance is always present (every trial has duration )

With --strategy statistical , capability, reliability, quality, and performance metrics include 95% confidence intervals:

{ "capability": { "claude": { "avgPassAtK": 0.92, "confidenceIntervals": { "avgPassAtK": [0.88, 0.95] } } }, "reliability": { "claude": { "type": "trial", "avgPassExpK": 0.78, "confidenceIntervals": { "avgPassExpK": [0.72, 0.84] } } }, "quality": { "claude": { "avgScore": 0.85, "confidenceIntervals": { "avgScore": [0.82, 0.88] } } }, "performance": { "claude": { "latency": { "mean": 1500 }, "confidenceIntervals": { "latencyMean": [1380, 1620] } } } }

See comparison-graders.md for complete comparison grader documentation including LLM-as-Judge patterns.

Comparison Grader Interface

CaptureResult grader:

import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => { // runs is Record<string, { output, trajectory?, score?, duration?, toolErrors? }> return { rankings: [ { run: 'with-mcp', rank: 1, score: 0.9 }, { run: 'vanilla', rank: 2, score: 0.7 }, ], reasoning: 'MCP run produced more accurate output' } }

TrialResult grader:

import type { TrialsComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: TrialsComparisonGrader = async ({ id, input, hint, runs }) => { // runs is Record<string, { passAtK?, passExpK?, k, trials }> // Each trial in trials has: { duration, score?, pass?, output, trajectory } return { rankings: [ { run: 'claude', rank: 1, score: 0.92 }, { run: 'gemini', rank: 2, score: 0.85 }, ], reasoning: 'Claude has higher reliability with lower flakiness' } }

Pipeline Workflow Diagram

flowchart LR Prompts["prompts.jsonl"] --> Run["run"] Schema["headless schema"] --> Run Run --> Raw["raw.jsonl"] Raw --> Extract["extract"] Schema --> Extract Extract --> Extracted["extracted.jsonl"] Extracted --> Grade["grade"] Grader["grader.ts"] --> Grade Grade --> Graded["graded.jsonl"] Graded --> Format["format"] Format --> Output["report.md / .csv / .jsonl"]

Graded --> Compare["compare"]
Results2["other runs..."] --> Compare
CompareGrader["compare-grader.ts"] --> Compare
Compare --> Comparison["comparison.jsonl"]

Schemas Command

Export JSON schemas for non-TypeScript tools.

List available schemas

bunx @plaited/agent-eval-harness schemas

Export all schemas as JSON

bunx @plaited/agent-eval-harness schemas --json -o schemas.json

Export specific schema

bunx @plaited/agent-eval-harness schemas CaptureResult --json bunx @plaited/agent-eval-harness schemas TrialResult --json bunx @plaited/agent-eval-harness schemas GraderResult --json

Available Schemas

Schema Description

CaptureResult

Single capture output (id, input, output, trajectory, timing)

TrialResult

Multi-run trial output (includes passAtK, passExpK)

GraderResult

Grader return value (pass, score, reasoning)

PromptInput

Input prompt format

TrajectoryStep

Single step in trajectory array

SummaryResult

Compact summary format

Usage in Other Languages

Export schemas for validation in Python, Go, etc.:

Export all schemas

bunx @plaited/agent-eval-harness schemas --json -o schemas.json

Use in Python with jsonschema

python -c " import json from jsonschema import validate

with open('schemas.json') as f: schemas = json.load(f)

with open('results.jsonl') as f: for line in f: result = json.loads(line) validate(result, schemas['CaptureResult']) print(f'{result["id"]}: valid') "

Grader Interface

Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in any language.

Git-Based Grading (Recommended for Coding Tasks)

Grade outcomes, not paths. Use the optional cwd parameter to detect environmental changes with git:

// git-grader.ts import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ output, hint, cwd }) => { if (!cwd) return { pass: false, score: 0, reasoning: 'No cwd' }

// Detect file changes const status = await Bun.$git -C ${cwd} status --porcelain.text() const filesCreated = status .split('\n') .filter(line => line.startsWith('??')) .map(line => line.slice(3).trim())

// Verify tests pass const testResult = await Bun.$cd ${cwd} && bun test.nothrow()

return { pass: filesCreated.length > 0 && testResult.exitCode === 0, score: testResult.exitCode === 0 ? 1 : 0, reasoning: Files: ${filesCreated.join(', ')}. Tests: ${testResult.exitCode === 0 ? 'pass' : 'fail'}, outcome: { // Optional: structured data for analysis filesCreated, testsPassed: testResult.exitCode === 0, type: 'file_creation_with_tests' } } }

See inline-graders.md for comprehensive git-based grading patterns.

Output-Based Grading (General Purpose)

// my-grader.ts import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') return { pass, score: pass ? 1 : 0, reasoning: pass ? 'Contains hint content' : 'Missing hint content' } }

Note: input can be string (single turn) or string[] (multi-turn). The hint field provides grader context (renamed from expected ).

Python/Executable Graders

Any executable can be a grader using stdin/stdout JSON protocol:

#!/usr/bin/env python3 import json, sys

data = json.load(sys.stdin) output = data.get("output", "").lower() hint = (data.get("hint") or "").lower()

pass_result = hint in output if hint else True print(json.dumps({ "pass": pass_result, "score": 1.0 if pass_result else 0.0, "reasoning": "Contains hint" if pass_result else "Missing hint" }))

chmod +x ./grader.py bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl

See inline-graders.md for complete grader documentation including LLM-as-Judge patterns.

Input Format

Each line in prompts.jsonl :

{"id":"test-001","input":"Create a button","hint":"should contain <button>"} {"id":"test-002","input":["Create a button","Make it blue"],"metadata":{"category":"ui"}}

Field Required Description

Yes Unique identifier

input

Yes Single prompt (string) or conversation turns (string[])

hint

No Grader context - what to look for (not strict match)

reference

No Reference solution (for validate-refs)

metadata

No Tags, category, difficulty for filtering

timeout

No Override default timeout for this prompt

Session behavior: Each JSONL entry = 1 fresh session

input: string → 1 session, 1 prompt
input: string[] → 1 session, N prompts (sequential turns)

Output Format

Full trajectory JSONL (always):

{ "id": "test-001", "input": "Find the CEO of Anthropic", "output": "The CEO of Anthropic is Dario Amodei.", "hint": "should mention Dario Amodei", "trajectory": [ {"type": "thought", "content": "I'll search for this...", "timestamp": 100}, {"type": "tool_call", "name": "WebSearch", "status": "completed", "input": {...}, "output": {...}, "duration": 500}, {"type": "message", "content": "The CEO of Anthropic is Dario Amodei.", "timestamp": 700} ], "metadata": { "category": "search", "agent": "--schema ./claude.json", "trajectoryRichness": "full", "turnCount": 1 }, "timing": { "start": 1704067200000, "end": 1704067201234, "firstResponse": 100, "sessionCreation": 234, "total": 1234, "inputTokens": 150, "outputTokens": 85 }, "toolErrors": false }

Output Fields

Field Description

input

Original prompt (string or string[] for multi-turn)

hint

Grader context hint (if provided)

metadata.trajectoryRichness

"full" | "messages-only" | "minimal"

metadata.turnCount

Number of conversation turns (1 for string, N for array)

timing.sessionCreation

Time to create session (ms)

timing.total

Total duration (end - start)

timing.inputTokens

Input tokens consumed (if available from adapter)

timing.outputTokens

Output tokens generated (if available from adapter)

toolErrors

Whether any tool calls failed

Note: toolErrors replaces misleading status: 'passed'|'failed' . Real pass/fail comes from YOUR grader.

Trust Boundary

Trajectory data contains content from the agent's execution, including tool results from external sources (web searches, file reads, API calls). The tool_call.input and tool_call.output fields preserve raw values from CLI output without sanitization.

Graders and analysis scripts should treat trajectory content as untrusted:

Validate cwd paths before using in shell commands (see inline-graders.md)
Do not execute trajectory content as code
Use injection-aware prompting when passing trajectory content to LLM-as-judge graders

Schema Exports

Consumers can import Zod schemas directly:

import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas'

// Validate external data const result = CaptureResultSchema.parse(jsonData)

// Generate JSON Schema (Zod 4 native) import { z } from 'zod' const jsonSchema = z.toJSONSchema(CaptureResultSchema)

Discriminated Unions for Reliability Metrics

Reliability metrics include a type discriminator for type-safe parsing:

import { z } from 'zod' import { ReliabilityMetricsSchema, // type: 'run' TrialsReliabilityMetricsSchema // type: 'trial' } from '@plaited/agent-eval-harness/schemas'

// Create a unified schema for both metric types const UnifiedReliabilitySchema = z.discriminatedUnion('type', [ ReliabilityMetricsSchema, TrialsReliabilityMetricsSchema, ])

// Type-safe parsing with automatic narrowing const metrics = UnifiedReliabilitySchema.parse(data) if (metrics.type === 'run') { // TypeScript knows: ReliabilityMetrics console.log(metrics.toolErrors, metrics.completionRate) } else { // TypeScript knows: TrialsReliabilityMetrics console.log(metrics.avgPassExpK, metrics.medianPassExpK) }

Or export JSON schemas for non-TypeScript tools:

bunx @plaited/agent-eval-harness schemas --json -o schemas.json bunx @plaited/agent-eval-harness schemas CaptureResult --json

Execution Environment

Recommendation: Run the harness in Docker containers for consistent, isolated execution.

Run integration tests via Docker

docker compose -f docker-compose.test.yml run --rm test

Or with explicit API keys

ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test

Docker Requirements

Requirement Reason

Node.js 24+ Gemini CLI uses modern JS features (optional chaining)

Non-root user Claude CLI blocks --dangerously-skip-permissions as root

Gemini API key Pass GEMINI_API_KEY for Gemini CLI

See docker-evals.md for complete Docker setup guide, debugging tips, and CI integration patterns.

Multi-turn Conversations

Use input: string[] to execute multi-turn conversations within a single session:

{"id":"context-001","input":["Remember this number: 42","What number did I ask you to remember?"],"hint":"42"} {"id":"context-002","input":["My name is Alice","What is my name?"],"hint":"Alice"}

Run with the headless adapter:

Using Claude Code via headless adapter

bunx @plaited/agent-eval-harness capture multi-turn.jsonl
bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json
-o results.jsonl

Using Gemini CLI via headless adapter

GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl
bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json
-o results.jsonl

Key points:

Each JSONL entry = 1 fresh session
input: string[] sends sequential turns to the same session
Works with both stream mode (Claude) and iterative mode (Gemini)
The adapter handles context preservation automatically

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

Filter with jq

cat results.jsonl | jq 'select(.metadata.category == "ui")'

Count tool usage

cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add'

Summarize for quick analysis

bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

Compare runs with built-in strategies

bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Quick Reference

Resource Description

bunx @plaited/agent-eval-harness

CLI help

output-formats.md JSONL schemas, command details

inline-graders.md Single input/output graders (TypeScript, Python, shell)

comparison-graders.md Comparison strategies (weighted, statistical, LLM-as-Judge)

calibration.md Grader calibration workflow

eval-concepts.md Evaluation concepts (pass@k, pass^k)

docker-evals.md Docker setup, debugging, CI integration

headless-adapters skill - Schema-driven adapters for headless CLI agents

agent-eval-harness

Safety Notice

Copy this and send it to your AI assistant to learn

Run without installing (recommended)

Or install as project dependency

Basic capture

Parallel execution (4x faster with 4 workers)

With workspace isolation for code generation tasks

Using a local adapter script

With grader (adds score to each result)

Capture only (no grader)

With grader (computes pass@k, pass^k)

Parallel execution (4 prompts' trials run concurrently)

With workspace isolation (each trial gets its own directory)

After capture

In CI (add as post-step)

Summary JSONL (for jq analysis)

Markdown (for LLM-as-judge)

Sample failures for human review

Re-score with different grader to compare

Validate reference solutions

Check for failures

Analyze prompt distribution

Pretty print

Execute → Extract → Grade → Format in one pipeline

Schema mode (recommended)

Simple mode: {} placeholder substitution

Shell mode: $PROMPT environment variable

From file

Piped from run

Markdown report

CSV for spreadsheets

JSONL (pass-through, default)

Default: auto-detect format, weighted strategy, JSON output

Statistical significance strategy

Custom weights via environment variables (CaptureResult)

Markdown report format

Custom grader (LLM-as-Judge)

With explicit labels

Auto-detect trials format

Explicit format (skip auto-detection)

Custom weights for trials comparison

List available schemas

Export all schemas as JSON

Export specific schema

Export all schemas

Use in Python with jsonschema

Run integration tests via Docker

Or with explicit API keys

Using Claude Code via headless adapter

Using Gemini CLI via headless adapter

Filter with jq

Count tool usage

Summarize for quick analysis

Compare runs with built-in strategies

Source Transparency

Related Skills

typescript-lsp

optimize-agents-md

scaffold-rules