double-agent

This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evaluation roles. It implements the Generator-Evaluator dual-agent architecture (inspired by Anthropic's engineering blog and GAN design), where a Generator produces outputs and a dedicated Evaluator agent independently validates them through real interaction (e.g., Playwright browser operations), eliminating AI self-evaluation bias. Use when: building AI-generated UIs/code that needs quality checks, designing multi-agent pipelines with feedback loops, or upgrading existing coder+tester workflows to real interaction-based evaluation.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "double-agent" with this command: npx skills add zmy1006-sudo/double-agent

DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

  1. Architecture templates for Generator-Evaluator agent pairs
  2. Evaluator prompt templates calibrated with few-shot scoring examples
  3. Iteration loop design for 5-15 round refinement cycles
  4. Playwright integration patterns for real browser-based evaluation
  5. Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.


When to Apply

ScenarioApply DoubleAgent?
AI-generated frontend UI with interactions✅ Yes
Multi-step workflow code (forms, flows)✅ Yes
API endpoint implementation + validation✅ Yes
Content generation (reports, copy, docs)✅ Yes (text-based evaluator)
Single-function refactoring⚠️ Optional
Simple config changes❌ Not needed

Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

  • Concrete (measurable outcomes, not vague goals)
  • Observable (evaluable through interaction or inspection)
  • Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

  • Do NOT ask the Generator to self-evaluate
  • Do NOT include evaluation criteria in the Generator's prompt
  • Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

  • Load references/evaluator-prompts.md for calibrated prompt templates
  • Use Playwright MCP for UI/web artifacts (real browser interaction)
  • Use structured JSON output for scores to enable automated loop control
  • Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

  • 1 example at ~30/100 (clearly bad)
  • 1 example at ~60/100 (mediocre)
  • 1 example at ~85/100 (good)
  • 1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.


Scoring Rubric Design

Effective rubrics for software systems:

DimensionWeightWhat to Measure
Functional completeness30%Does each spec requirement work end-to-end?
Interaction quality25%Click/form/navigation behavior as a real user
Edge case handling20%Error states, empty data, boundary inputs
Code/design quality15%Consistency, readability, no obvious anti-patterns
Originality / craft10%Avoids generic/template outputs when spec requires uniqueness

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".


Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

  1. Navigate to the deployed artifact URL
  2. Execute each spec requirement as a user action sequence
  3. Observe actual behavior (DOM state, network requests, visual output)
  4. Record pass/fail per requirement with screenshots
  5. Report structured JSON with score breakdown

Playwright MCP tool calls to use:

  • playwright_navigate → open URL
  • playwright_click → interact with elements
  • playwright_fill → fill form inputs
  • playwright_screenshot → capture evidence
  • playwright_get_visible_text → verify content

Reference Files

  • references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
  • references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

  • scripts/iteration_loop.py — Complete iteration loop implementation template
  • scripts/calibrate_evaluator.py — Evaluator calibration utility

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

GigaChat (Sber AI) Proxy

Integrate GigaChat (Sber AI) with OpenClaw via gpt2giga proxy

Registry SourceRecently Updated
3600smvlx
General

TencentCloud Video Face Fusion

通过提取两张人脸核心特征并实现自然融合,支持多种风格适配,提升创意互动性和内容传播力,广泛应用于创意营销、娱乐互动和社交分享场景。

Registry SourceRecently Updated
General

TencentCloud Image Face Fusion

图片人脸融合(专业版)为同步接口,支持自定义美颜、人脸增强、牙齿增强、拉脸等参数,最高支持8K分辨率,有多个模型类型供选择。

Registry SourceRecently Updated
General

YoudaoNote News

有道云笔记资讯推送:基于收藏笔记分析关注话题,推送最新相关资讯。支持对话触发与每日定时推送(如早上9点)。触发词:资讯推送、设置资讯推送、生成资讯推送。

Registry SourceRecently Updated
1.5K1lephix