agent:eval

Agent Evaluation System

Guides the user through building a comprehensive evaluation system for their AI agent. Applies patterns 10-17 from "Patterns for Building AI Agents" (Bhagwat & Gienow, 2025): failure mode taxonomy, business metrics, cross-referencing, iterating against evals, test suites, SME labeling, production datasets, and live evaluation.

When to use

Use this skill when the user needs to:

Define what "good" looks like for an AI agent
Create a failure mode taxonomy
Set up business metrics for agent performance
Build an evaluation test suite
Design SME labeling workflows
Plan production data evaluation pipelines

Instructions

Step 1: Understand the Agent

Use the AskUserQuestion tool to gather context:

What does the agent do? (domain, tasks, outputs)
Who are the end users?
What are the consequences of wrong outputs? (low = inconvenience, high = financial/legal/safety)
Is there an existing agent design? (check .specs/<spec-name>/ )
Do you have existing test data or production logs?

Read any existing spec documents before proceeding.

Step 2: List Failure Modes (Pattern 10)

Build a classification of failure reasons. LLM outputs are nondeterministic — you need to understand not just WHAT fails, but WHY.

Use AskUserQuestion to explore failure categories with the user. Start with these common categories and adapt to the domain:

Category Description Example

Data Quality Agent received wrong, incomplete, or ambiguous input Missing fields, contradictory data

Reasoning Failure Agent had correct data but drew wrong conclusions Incorrect logic chain, hallucinated facts

Rule Misapplication Agent misapplied domain-specific rules or policies Wrong insurance code, incorrect legal precedent

Tool Failure External tool/API call failed or returned unexpected results Timeout, wrong API response format

Context Failure Agent lost track of important context Forgot earlier constraint, ignored user correction

Output Format Correct answer but wrong format or structure Missing required fields, wrong data types

Ask the user to identify domain-specific failure modes.

Output:

Failure Mode Taxonomy

ID	Category	Failure Mode	Description	Severity
F1	Reasoning	[Name]	[Description]	Critical / High / Medium / Low
F2	Data Quality	[Name]	[Description]	Critical / High / Medium / Low
F3	[Domain]	[Name]	[Description]	Critical / High / Medium / Low

Step 3: List Critical Business Metrics (Pattern 11)

Define metrics that connect agent performance to business value. Use AskUserQuestion to identify metrics in three categories:

Accuracy metrics (baseline):

False positive rate
False negative rate
Overall accuracy / F1 score

Domain-specific outcome metrics:

What domain-specific outcomes matter? (e.g., missed critical terms in legal, dollar loss in finance, resolution time in support)

Human team metrics:

How does the equivalent human team perform?
What is the target agent performance vs. human baseline?

Ask the user to identify the north star metric — the single most important metric.

Output:

Business Metrics

North Star Metric

[Metric name]: [Description and why it matters most] Current baseline: [Human performance or current agent performance] Target: [Goal]

Accuracy Metrics

Metric	Current	Target	Measurement
False positive rate	[X%]	[Y%]	[How measured]
False negative rate	[X%]	[Y%]	[How measured]
Overall accuracy	[X%]	[Y%]	[How measured]

Domain-Specific Metrics

Metric	Current	Target	Business Impact
[Metric 1]	[X]	[Y]	[Why it matters]
[Metric 2]	[X]	[Y]	[Why it matters]

Step 4: Cross-Reference Failure Modes and Metrics (Pattern 12)

Map which failure modes drive which metrics. This turns metrics into actionable improvement work.

Failure Mode → Metric Impact Matrix

Failure Mode	North Star Impact	Other Metrics Affected	Priority
F1: [Name]	HIGH — directly causes [metric] regression	[Other metrics]	P0
F2: [Name]	MEDIUM — contributes to [metric]	[Other metrics]	P1
F3: [Name]	LOW — rare but severe	[Other metrics]	P2

Define the improvement cycle:

Improvement Cycle

SME Review — Domain experts review agent outputs, classify failure modes
PM Prioritization — Cross-reference metrics + failure modes, set next target
- Current: [X%] → Next target: [Y%]
Engineering — Experiment with fixes using failure-mode-specific datasets
Validation — Test against past production data, decide go/no-go

Step 5: Design Eval Test Suite (Pattern 13-14)

Help the user build an evaluation test suite.

Use AskUserQuestion to determine data sources:

Synthetic data — Use LLM to generate test cases (fastest to start)
Internal user data — Real data from internal testing
SME golden answers — Expert-created input/output pairs (highest quality)
Production data — Real user interactions (most realistic, available later)

Test suite structure:

Eval Test Suite

Suite Metadata

Total test cases: [N]
Data sources: [Synthetic / Internal / SME / Production]
Evaluation method: [LLM-as-judge / Exact match / Human review]
CI integration: [Yes/No — run on every code change]

Evaluation Criteria

Criterion	Weight	Scoring	Description
Accuracy	40%	Binary (pass/fail)	Factually correct output
Completeness	25%	Binary	All required information present
Relevance	20%	Binary	Focused on the user's actual question
Format	15%	Binary	Correct structure and data types

Regression Policy

Merge blocker: Any change that reduces overall accuracy below [X%]
Review required: Any change that regresses accuracy by > [Y%]
Paired improvements: If a regression in one area is necessary, pair with offsetting improvements elsewhere

Test Case Template

Field	Description
`id`	Unique test case identifier
`input`	The user input / agent prompt
`expected_output`	The correct or ideal response
`failure_modes`	Which failure modes this tests (F1, F2, ...)
`metadata`	Source, date added, domain category

Scoring recommendation: Use binary (pass/fail) or categorical (good/fair/poor) scoring. Avoid numerical scales (1-10) — LLMs are better at categorical than numerical judgment.

Step 6: SME Labeling Plan (Pattern 15)

Design how subject matter experts will validate agent outputs.

Use AskUserQuestion to understand:

Who are the domain experts? (role, availability)
What tools will they use for labeling? (custom UI, spreadsheet, observability tool)
How many annotators per data point? (recommend 2+ for inter-rater reliability)

SME Labeling Plan

Annotators

Role	Count	Domain	Availability
[Role 1]	[N]	[Domain area]	[Hours/week]

Labeling Schema

Each review includes:

Overall grade: Pass / Partial / Fail
Category tags: [List of failure mode IDs that apply]
Subjective feedback: Free-text explanation (optional)

Labeling Workflow

Agent generates output → logged to observability tool
Automated flags trigger review (guardrail violations, CI failures, low-confidence outputs)
Random sampling of unflagged outputs ([X%] sample rate)
SME reviews full trace: user input → tool calls → reasoning → output
SME labels using schema above
Labels feed back into eval test suite

Inter-Rater Reliability

Metric: Cohen's Kappa / Fleiss' Kappa
Target: > 0.7 (substantial agreement)
Calibration: Weekly sync to align on edge cases

Step 7: Production Data Pipeline (Patterns 16-17)

Design how production data flows into the evaluation system.

Production Data Pipeline

Data Collection

Observability tool: [Tool name — e.g., LangSmith, Braintrust, custom]
Logged fields: Input, output, tool calls, latency, token usage, model version
Storage: [Where datasets are stored — not JSONL files, use versioned store]

Live Evaluation

Method: LLM-as-judge with defined evaluation prompt
Scoring: [Binary / Categorical] — strongly recommended over numerical
Sampling: Evaluate [X%] of production responses
Frequency: [Real-time / Hourly / Daily batch]

Evaluation Prompt Template

You are evaluating an AI agent's response.

User input: {input} Agent output: {output} Expected behavior: {criteria}

Grade the response as PASS or FAIL. Explain your reasoning in one sentence.

Dataset Versioning

Version datasets when: new failure modes discovered, distribution shift detected
Store: inputs, expected outputs, metadata (source, date, failure mode tags)
Review cadence: [Weekly / Monthly] — check if synthetic data still matches production reality

Feedback Loop

Production data → SME review → New test cases → Eval suite update → CI regression check

Step 8: Generate Eval Document

Compile all outputs into .specs/<spec-name>/agent-eval.md .

Step 9: Offer Next Steps

Use AskUserQuestion to offer:

Create initial test cases — generate synthetic eval data based on the failure modes
Proceed to security audit — run agent:secure
Full review — run agent:review

Arguments

<args>
Optional spec name
<spec-name> — reads existing agent design from .specs/<spec-name>/

Examples:

agent:eval customer-support — design eval system for the customer-support agent
agent:eval — start fresh, will ask for details

Safety Notice

Copy this and send it to your AI assistant to learn