Hallucination Guard
4-layer defense against agent fabrication. Each layer is independent — use one or combine.
When Hallucinations Happen
Highest risk conditions (apply more layers when these are present):
- Extended sessions (>50 turns or >30min continuous work)
- Tasks involving file creation, code, git, or data analysis
- Agent reporting quantitative results (numbers, metrics, PnL)
- Multiple sequential "successes" with no errors or retries
Layer 0: Context Hygiene (Prevention)
Reduce hallucination probability before it starts.
For long tasks (>10 steps):
- Break into segments of ≤8 steps each
- Between segments: flush working state to a file, reload from file (not from in-context memory)
- Each segment starts with
readof the state file — never trust carried-over context for facts
For data-intensive tasks:
- Load source data from files at point of use, not from earlier context
- If a number was mentioned 20+ turns ago, re-read the source before citing it
Cost: Zero. This is a workflow discipline, not an API call.
Layer 1: Claim-Evidence Protocol (Detection)
Every agent claim of physical action must include tool-verified evidence.
The Rule
CLAIM: "I created/modified/committed X"
EVIDENCE: Tool output proving X exists and matches the claim
STATUS: VERIFIED (evidence confirms) or UNVERIFIED (no evidence yet)
Verification Commands by Claim Type
| Claim | Verify With |
|---|---|
| Created file | ls -la {path} && head -20 {path} |
| Modified file | grep -n '{expected_content}' {path} |
| Git commit | git log --oneline -3 |
| Git push | git log --oneline origin/{branch} -3 |
| Ran tests | Show actual test output (pass AND fail counts) |
| API response | Show raw response body |
| Data analysis | Show wc -l of source + sample rows |
Red Flags (claim likely fabricated)
- Claim references a file but no
read/exectool was called - Exact round numbers in data (187 trades, +$126.50) without source
- "All tests passed" with no test output shown
- Multiple consecutive successes with zero errors
Cost: ~50 tokens per claim. One exec call per physical claim.
Layer 2: Cross-Model Audit (Verification)
Spawn a second agent (different model) to independently verify claims.
When to Use
- Critical outputs: financial reports, deployment decisions, data analysis
- When L1 evidence exists but numbers need independent validation
- After any task where the agent reported unusually perfect results
How to Run
See references/audit-prompt.md for the spawn template.
Key principles:
- Auditor receives ONLY the evidence (files, outputs) — not the original agent's conclusions
- Auditor independently extracts facts from evidence and compares to claims
- Auditor uses the cheapest model that can do the verification (flash for file checks, sonnet for logic)
Cost: 1 subagent spawn. Use flash/gemini for simple checks (~$0.001). Reserve sonnet/opus for complex logic verification.
Layer 3: Drift Detection (Monitoring)
Monitor long-running agent tasks for hallucination patterns.
When to Use
- Tasks expected to take >15 minutes
- Agent is working autonomously (coding agent, research agent)
- High-stakes tasks where undetected fabrication causes real damage
Setup
See references/drift-monitor.md for implementation.
Core signals:
- Claim/Tool Ratio: If claims > 3× tool calls → alert
- Zero-Error Streak: 8+ consecutive "successes" with 0 errors → suspicious
- Phantom References: Agent references files/branches never created → critical alert
Cost: Periodic check via sessions_history. No extra model calls unless alert triggers.
Choosing Layers
| Scenario | Recommended |
|---|---|
| Quick file creation | L1 only |
| Data report from CSV | L0 + L1 |
| Multi-step coding task | L0 + L1 + L2 |
| Autonomous long-running agent | All four layers |
| Routine conversation | None needed |
Integration with Other Skills
- War Room: Add L1 verification to each agent's output (verify cited data)
- Coding agents: Wrap with L3 drift monitor for long sessions
- Any task with
sessions_spawn: Add L2 audit as a final verification step
References
- references/audit-prompt.md — Cross-model audit spawn template
- references/drift-monitor.md — Drift detection implementation
- references/taxonomy.md — Hallucination types with real-world examples