agent:eval

Agent Evaluation System

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "agent:eval" with this command: npx skills add ikatsuba/skills/ikatsuba-skills-agent-eval

Agent Evaluation System

Guides the user through building a comprehensive evaluation system for their AI agent. Applies patterns 10-17 from "Patterns for Building AI Agents" (Bhagwat & Gienow, 2025): failure mode taxonomy, business metrics, cross-referencing, iterating against evals, test suites, SME labeling, production datasets, and live evaluation.

When to use

Use this skill when the user needs to:

  • Define what "good" looks like for an AI agent

  • Create a failure mode taxonomy

  • Set up business metrics for agent performance

  • Build an evaluation test suite

  • Design SME labeling workflows

  • Plan production data evaluation pipelines

Instructions

Step 1: Understand the Agent

Use the AskUserQuestion tool to gather context:

  • What does the agent do? (domain, tasks, outputs)

  • Who are the end users?

  • What are the consequences of wrong outputs? (low = inconvenience, high = financial/legal/safety)

  • Is there an existing agent design? (check .specs/<spec-name>/ )

  • Do you have existing test data or production logs?

Read any existing spec documents before proceeding.

Step 2: List Failure Modes (Pattern 10)

Build a classification of failure reasons. LLM outputs are nondeterministic — you need to understand not just WHAT fails, but WHY.

Use AskUserQuestion to explore failure categories with the user. Start with these common categories and adapt to the domain:

Category Description Example

Data Quality Agent received wrong, incomplete, or ambiguous input Missing fields, contradictory data

Reasoning Failure Agent had correct data but drew wrong conclusions Incorrect logic chain, hallucinated facts

Rule Misapplication Agent misapplied domain-specific rules or policies Wrong insurance code, incorrect legal precedent

Tool Failure External tool/API call failed or returned unexpected results Timeout, wrong API response format

Context Failure Agent lost track of important context Forgot earlier constraint, ignored user correction

Output Format Correct answer but wrong format or structure Missing required fields, wrong data types

Ask the user to identify domain-specific failure modes.

Output:

Failure Mode Taxonomy

IDCategoryFailure ModeDescriptionSeverity
F1Reasoning[Name][Description]Critical / High / Medium / Low
F2Data Quality[Name][Description]Critical / High / Medium / Low
F3[Domain][Name][Description]Critical / High / Medium / Low

Step 3: List Critical Business Metrics (Pattern 11)

Define metrics that connect agent performance to business value. Use AskUserQuestion to identify metrics in three categories:

  1. Accuracy metrics (baseline):
  • False positive rate

  • False negative rate

  • Overall accuracy / F1 score

  1. Domain-specific outcome metrics:
  • What domain-specific outcomes matter? (e.g., missed critical terms in legal, dollar loss in finance, resolution time in support)
  1. Human team metrics:
  • How does the equivalent human team perform?

  • What is the target agent performance vs. human baseline?

Ask the user to identify the north star metric — the single most important metric.

Output:

Business Metrics

North Star Metric

[Metric name]: [Description and why it matters most] Current baseline: [Human performance or current agent performance] Target: [Goal]

Accuracy Metrics

MetricCurrentTargetMeasurement
False positive rate[X%][Y%][How measured]
False negative rate[X%][Y%][How measured]
Overall accuracy[X%][Y%][How measured]

Domain-Specific Metrics

MetricCurrentTargetBusiness Impact
[Metric 1][X][Y][Why it matters]
[Metric 2][X][Y][Why it matters]

Step 4: Cross-Reference Failure Modes and Metrics (Pattern 12)

Map which failure modes drive which metrics. This turns metrics into actionable improvement work.

Failure Mode → Metric Impact Matrix

Failure ModeNorth Star ImpactOther Metrics AffectedPriority
F1: [Name]HIGH — directly causes [metric] regression[Other metrics]P0
F2: [Name]MEDIUM — contributes to [metric][Other metrics]P1
F3: [Name]LOW — rare but severe[Other metrics]P2

Define the improvement cycle:

Improvement Cycle

  1. SME Review — Domain experts review agent outputs, classify failure modes
  2. PM Prioritization — Cross-reference metrics + failure modes, set next target
    • Current: [X%] → Next target: [Y%]
  3. Engineering — Experiment with fixes using failure-mode-specific datasets
  4. Validation — Test against past production data, decide go/no-go

Step 5: Design Eval Test Suite (Pattern 13-14)

Help the user build an evaluation test suite.

Use AskUserQuestion to determine data sources:

  • Synthetic data — Use LLM to generate test cases (fastest to start)

  • Internal user data — Real data from internal testing

  • SME golden answers — Expert-created input/output pairs (highest quality)

  • Production data — Real user interactions (most realistic, available later)

Test suite structure:

Eval Test Suite

Suite Metadata

  • Total test cases: [N]
  • Data sources: [Synthetic / Internal / SME / Production]
  • Evaluation method: [LLM-as-judge / Exact match / Human review]
  • CI integration: [Yes/No — run on every code change]

Evaluation Criteria

CriterionWeightScoringDescription
Accuracy40%Binary (pass/fail)Factually correct output
Completeness25%BinaryAll required information present
Relevance20%BinaryFocused on the user's actual question
Format15%BinaryCorrect structure and data types

Regression Policy

  • Merge blocker: Any change that reduces overall accuracy below [X%]
  • Review required: Any change that regresses accuracy by > [Y%]
  • Paired improvements: If a regression in one area is necessary, pair with offsetting improvements elsewhere

Test Case Template

FieldDescription
idUnique test case identifier
inputThe user input / agent prompt
expected_outputThe correct or ideal response
failure_modesWhich failure modes this tests (F1, F2, ...)
metadataSource, date added, domain category

Scoring recommendation: Use binary (pass/fail) or categorical (good/fair/poor) scoring. Avoid numerical scales (1-10) — LLMs are better at categorical than numerical judgment.

Step 6: SME Labeling Plan (Pattern 15)

Design how subject matter experts will validate agent outputs.

Use AskUserQuestion to understand:

  • Who are the domain experts? (role, availability)

  • What tools will they use for labeling? (custom UI, spreadsheet, observability tool)

  • How many annotators per data point? (recommend 2+ for inter-rater reliability)

SME Labeling Plan

Annotators

RoleCountDomainAvailability
[Role 1][N][Domain area][Hours/week]

Labeling Schema

Each review includes:

  1. Overall grade: Pass / Partial / Fail
  2. Category tags: [List of failure mode IDs that apply]
  3. Subjective feedback: Free-text explanation (optional)

Labeling Workflow

  1. Agent generates output → logged to observability tool
  2. Automated flags trigger review (guardrail violations, CI failures, low-confidence outputs)
  3. Random sampling of unflagged outputs ([X%] sample rate)
  4. SME reviews full trace: user input → tool calls → reasoning → output
  5. SME labels using schema above
  6. Labels feed back into eval test suite

Inter-Rater Reliability

  • Metric: Cohen's Kappa / Fleiss' Kappa
  • Target: > 0.7 (substantial agreement)
  • Calibration: Weekly sync to align on edge cases

Step 7: Production Data Pipeline (Patterns 16-17)

Design how production data flows into the evaluation system.

Production Data Pipeline

Data Collection

  • Observability tool: [Tool name — e.g., LangSmith, Braintrust, custom]
  • Logged fields: Input, output, tool calls, latency, token usage, model version
  • Storage: [Where datasets are stored — not JSONL files, use versioned store]

Live Evaluation

  • Method: LLM-as-judge with defined evaluation prompt
  • Scoring: [Binary / Categorical] — strongly recommended over numerical
  • Sampling: Evaluate [X%] of production responses
  • Frequency: [Real-time / Hourly / Daily batch]

Evaluation Prompt Template

You are evaluating an AI agent's response.

User input: {input} Agent output: {output} Expected behavior: {criteria}

Grade the response as PASS or FAIL. Explain your reasoning in one sentence.

Dataset Versioning

  • Version datasets when: new failure modes discovered, distribution shift detected
  • Store: inputs, expected outputs, metadata (source, date, failure mode tags)
  • Review cadence: [Weekly / Monthly] — check if synthetic data still matches production reality

Feedback Loop

Production data → SME review → New test cases → Eval suite update → CI regression check

Step 8: Generate Eval Document

Compile all outputs into .specs/<spec-name>/agent-eval.md .

Step 9: Offer Next Steps

Use AskUserQuestion to offer:

  • Create initial test cases — generate synthetic eval data based on the failure modes

  • Proceed to security audit — run agent:secure

  • Full review — run agent:review

Arguments

  • <args>

  • Optional spec name

  • <spec-name> — reads existing agent design from .specs/<spec-name>/

Examples:

  • agent:eval customer-support — design eval system for the customer-support agent

  • agent:eval — start fresh, will ask for details

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

agent:memory

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

agent:design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

agent:prompt

No summary provided by upstream source.

Repository SourceNeeds Review