qa-agent-testing

QA Agent Testing (Jan 2026)

Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.

Default QA Workflow

Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
Define 10 representative tasks (Must Ace).
Define 5 refusal edge cases (Must Decline + redirect).
Define an output contract (format, tone, structure, citations).
Run the suite with determinism controls and tool tracing.
Score with the 6-dimension rubric; track variance across reruns.
Log baselines and regressions; gate merges/deploys on thresholds.

Use the copy-paste templates in assets/ for day-0 setup.

Determinism and Flake Control

Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
Control sampling: fixed seeds/temperatures where supported; log model/config versions.
Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.

Two-Layer Evaluation (2026)

Evaluate reasoning and action layers separately:

Layer What to Test Key Metrics

Reasoning Planning, decision-making, intent Intent resolution, task adhesion, context retention

Action Tool calls, execution, side effects Tool call accuracy, completion rate, error recovery

Evaluation Dimensions (Score What Matters)

Dimension What to Measure Level

Task success Correct outcome and constraints met Agent

Safety/policy Correct refusals and safe alternatives Agent

Reliability Stability across reruns and small prompt changes Agent

Latency/cost Budgets per task and per suite Business

Debuggability Failures produce evidence (logs, traces) Agent

Factual grounding Hallucination rate, citation accuracy Model

Bias detection Fairness across demographic inputs Model

CI Economics

PR gate: small, high-signal smoke eval suite.
Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).

Robustness and Security Tests (Recommended)

Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.

Do / Avoid

Do:

Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
Quarantine flaky evals with owners and expiry, just like flaky tests in CI.

Avoid:

Evaluating only "happy prompts" with no tool failures and no adversarial inputs.
Letting self-evaluations substitute for ground-truth checks.

Quick Reference

Need Use Location

Build the 10 tasks Task patterns + examples references/test-case-design.md

Design refusals Refusal categories + templates references/refusal-patterns.md

Score runs Detailed rubric + thresholds references/scoring-rubric.md

Compute suite math quickly CLI utility script scripts/score_suite.py

Manage regressions Re-run workflow + baseline policy references/regression-protocol.md

Sandbox tools Isolation tiers + hardening references/tool-sandboxing.md

Test multi-agent systems Coordination patterns + suite template references/multi-agent-testing.md

Use LLM-as-judge safely Biases + mitigations references/llm-judge-limitations.md

Test prompt injection attacks Injection taxonomy + test cases references/prompt-injection-testing.md

Detect hallucinations Detection methods + scoring references/hallucination-detection.md

Design eval datasets Dataset construction + maintenance references/eval-dataset-design.md

Start from templates Harness + scoring sheet + log assets/

Decision Tree

Testing an agent?

New agent?
- Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
Prompt changed?
- Re-run full 15-check suite -> Compare to baseline
Tool/knowledge changed?
- Re-run affected tests -> Log in regression log
Quality review?
- Score against rubric -> Identify weak areas -> Fix prompt

Scoring and Gates

Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
Use scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.
For detailed methodology (including judge calibration and variance metrics), see references/scoring-rubric.md .

Navigation

Resources

references/test-case-design.md
10-task patterns + validation + metamorphic add-ons
references/refusal-patterns.md
refusal categories + response templates + test tactics
references/scoring-rubric.md
scoring guide, thresholds, variance metrics, judge calibration
references/regression-protocol.md
re-run scope, baseline policy, recovery procedures
references/tool-sandboxing.md
sandbox tiers, tool hardening, injection/exfil test ideas
references/multi-agent-testing.md
coordination testing patterns + suite template
references/llm-judge-limitations.md
LLM-as-judge biases, limits, mitigations
references/prompt-injection-testing.md
Injection taxonomy, test cases, and defense validation
references/hallucination-detection.md
Hallucination detection methods, scoring, and benchmarks
references/eval-dataset-design.md
Evaluation dataset construction, versioning, and maintenance

Templates

assets/qa-harness-template.md
copy-paste harness
assets/scoring-sheet.md
scoring tracker
assets/regression-log.md
version tracking

External Resources

See data/sources.json for:

LLM evaluation research
Red-teaming methodologies
Prompt testing frameworks

Related Skills

qa-testing-strategy: ../qa-testing-strategy/SKILL.md - General testing strategies
ai-prompt-engineering: ../ai-prompt-engineering/SKILL.md - Prompt design patterns

Quick Start

Copy assets/qa-harness-template.md
Fill in PUT (Persona Under Test) section
Define 10 representative tasks for your agent
Add 5 refusal edge cases
Specify output contracts
Run baseline test
Log results in regression log

Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

qa-agent-testing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

product-management

marketing-visual-design

startup-idea-validation

software-architecture-design