DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

Architecture templates for Generator-Evaluator agent pairs
Evaluator prompt templates calibrated with few-shot scoring examples
Iteration loop design for 5-15 round refinement cycles
Playwright integration patterns for real browser-based evaluation
Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.

When to Apply

Scenario	Apply DoubleAgent?
AI-generated frontend UI with interactions	✅ Yes
Multi-step workflow code (forms, flows)	✅ Yes
API endpoint implementation + validation	✅ Yes
Content generation (reports, copy, docs)	✅ Yes (text-based evaluator)
Single-function refactoring	⚠️ Optional
Simple config changes	❌ Not needed

Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

Concrete (measurable outcomes, not vague goals)
Observable (evaluable through interaction or inspection)
Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

Do NOT ask the Generator to self-evaluate
Do NOT include evaluation criteria in the Generator's prompt
Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

Load references/evaluator-prompts.md for calibrated prompt templates
Use Playwright MCP for UI/web artifacts (real browser interaction)
Use structured JSON output for scores to enable automated loop control
Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

1 example at ~30/100 (clearly bad)
1 example at ~60/100 (mediocre)
1 example at ~85/100 (good)
1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.

Scoring Rubric Design

Effective rubrics for software systems:

Dimension	Weight	What to Measure
Functional completeness	30%	Does each spec requirement work end-to-end?
Interaction quality	25%	Click/form/navigation behavior as a real user
Edge case handling	20%	Error states, empty data, boundary inputs
Code/design quality	15%	Consistency, readability, no obvious anti-patterns
Originality / craft	10%	Avoids generic/template outputs when spec requires uniqueness

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".

Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

Navigate to the deployed artifact URL
Execute each spec requirement as a user action sequence
Observe actual behavior (DOM state, network requests, visual output)
Record pass/fail per requirement with screenshots
Report structured JSON with score breakdown

Playwright MCP tool calls to use:

playwright_navigate → open URL
playwright_click → interact with elements
playwright_fill → fill form inputs
playwright_screenshot → capture evidence
playwright_get_visible_text → verify content

Reference Files

references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

scripts/iteration_loop.py — Complete iteration loop implementation template
scripts/calibrate_evaluator.py — Evaluator calibration utility

double-agent

Safety Notice

Copy this and send it to your AI assistant to learn