Agent QA Gates
A field-tested validation system for AI agent output. Born from production failures, not theory.
Quick Start
Before any agent delivers output, run the Pre-Ship Checklist:
- Accurate? — every number/date/metric has a source. Unsourced → prefix "estimated"
- Complete? — no missing pieces, no "I'll do that next"
- Actionable? — ends with clear next step or decision point
- Fits the channel? — check character limits for your delivery surface
- No leaks? — no internal context, private data, or secrets
- Not a duplicate? — verify no recent identical send
- Would the human be embarrassed? — if yes, don't ship
Gate Tiers
Four ascending tiers by risk level:
| Gate | Scope | Key Checks |
|---|---|---|
| Gate 0 | Internal (files, config, memory) | Mechanism changed not just text, no placeholders, file exists |
| Gate 1 | Human-facing (briefings, summaries) | Key info in first 2 lines, ≤3-line paragraphs, channel length limits |
| Gate 2 | External (email, public content, client materials) | No internal context leaked, recipient-appropriate tone, dedup check |
| Gate 3 | Code & technical | Builds clean, no secrets in code, error handling, tests pass |
See references/gates-detail.md for full gate checklists.
Severity Classification
Not all failures are equal:
- 🔴 BLOCK — cannot ship (secrets, privacy, hallucinated data, wrong recipient)
- 🟡 FIX — fix before shipping, <2 min (formatting, too long, missing citation)
- 🟢 NOTE — log and ship (style preference, minor optimization)
Protocol Gates
Recurring failure modes need dedicated gates. These are the most common:
Heartbeat / Periodic Check Output
- Binary output: alert text ONLY or status-OK ONLY. Never mixed.
- Every data point verified by current-session tool call. No hallucinated metrics.
- No stale data from previous cycles or pre-compaction sessions.
Post-Compaction / Context Reset
- Do not trust facts from the pre-reset session — verify from files and tools.
- Rerun pending checks from scratch.
- Zero carryover for periodic checks.
Scheduled Job / Cron Changes
- Explicit timeout set
- Explicit model set
- Verify schedule after creation
- Output fits destination channel limits
Sub-Agent Output Review
- Does output match the brief's success criteria?
- Any uncertainty flags unresolved?
- Is the reasoning (not just the conclusion) sound?
Gate Evolution
Gates should evolve based on real failures, not imagination:
- When a failure occurs → log it with root cause
- Same failure class occurs 2+ times → add a gate item
- Monthly: prune gates that haven't caught anything in 60 days
Anti-Patterns
- Gates that sound good but never catch anything → kill them
- Per-agent checklists that duplicate general gates → merge or reference
- "ADHD-friendly" or "high-quality" as gate items → not testable, replace with mechanical checks
- Aspirational gates nobody runs → either automate or cut
Adapting to Your System
This skill provides the pattern. Adapt it:
- Start with the Pre-Ship Checklist — it works for any agent system
- Add Protocol Gates for your top 3 recurring failure modes
- Set channel limits for your delivery surfaces
- Map real failures to gates — if a failure isn't gated, add the gate
- Kill gates that never fire — a shorter, sharper checklist wins
For the full reference implementation, see references/gates-detail.md.
For automation scripts, see scripts/qa-check.sh.