long-run-harness

Use when building a Planner→Generator→Evaluator multi-agent harness with the Claude SDK. Triggers: "build a harness", "multi-agent pipeline", "agent loop", "automate app building with Claude", "GAN-style agent system", "sprint-based Claude workflow", "I want Claude to plan, build, and evaluate automatically", "long-running orchestrator". NOT for: asking Claude to build an app directly, single-file edits, pure API usage questions.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "long-run-harness" with this command: npx skills add is-xins-xiaobai/long-run-harness

Long-Running App Harness — SDK Implementation

Produces a runnable harness that orchestrates Claude agents via claude_agent_sdk. You are writing the harness, not running inside it.

Use query() + ClaudeAgentOptions for agentic loops; tool() + create_sdk_mcp_server() for structured output. Never anthropic.Anthropic() directly.

pip install claude-agent-sdk

Output structure:

harness/
  harness.py; config.yaml; config.py; log.py
  agents/ planner.py; generator.py; evaluator.py
  models/ state.py
  prompts/ planner.md; generator.md; evaluator.md

Routing

User SignalRoute
"build a harness / pipeline"Start at Phase 1
"add an evaluator"Jump to Phase 4
"add state / handoff"Jump to Phase 5
"looping forever / broken"Check feedback loop termination in Phase 5
"just explain what a harness does"Explain concept, don't write code

Phase 1: Design the Harness

Load: $SKILL_DIR/instructions/planner-questions.md

⚠️ HARD GATE: Ask the design questions. Get answers to 1–3 before writing any code:

  1. What does the harness build? (sets Generator tools + Evaluator rubric)
  2. Python or TypeScript? (default: Python)
  3. Models per agent? (default: all claude-opus-4-7; non-defaults → config.yaml)

Create skeleton:

mkdir -p harness/agents harness/models harness/prompts harness/harness-logs
touch harness/harness.py harness/log.py harness/agents/__init__.py harness/models/__init__.py

config.yaml + config.py — all tunable parameters here; never hardcode in agent files. Load: $SKILL_DIR/instructions/config.md for the full HarnessConfig dataclass.

cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
# Always: cfg.agents.generator_model  — never: "claude-opus-4-7"

models/state.py — write first; all other files import from it. Load: $SKILL_DIR/instructions/context-handoff.md (HandoffState, EvalResult, format_handoff_for_prompt). Load: $SKILL_DIR/instructions/sprint-contracts.md (SprintContract + negotiation protocol).

log.py — dual stdout + timestamped file under harness-logs/. Load: $SKILL_DIR/instructions/logging.md for full implementation.

log.setup(PROJECT_DIR, label="run")  # once in main()
logger = log.get()                   # in every agent

Phase 2: Planner Agent

Load: $SKILL_DIR/instructions/planner-questions.md for system prompt template. Load: $SKILL_DIR/instructions/agent-patterns.md for full run_planner implementation.

run_planner(brief, session_id, cfg)(reply, new_session_id). ClaudeAgentOptions(resume=session_id) continues session without resending history.

spec, session_id = "", None
while "SPEC_COMPLETE" not in spec:
    user_input = input("[Planner asks]: ").strip() if session_id else initial_brief
    spec, session_id = run_planner(user_input, session_id, cfg)
SPEC_PATH.write_text(spec.replace("SPEC_COMPLETE", "").strip())

Phase 3: Generator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for run_generator + self_assess implementations.

def run_generator(
    spec, contract, project_dir,
    handoff=None, strategic_framing=None, cfg=None,
) -> str: ...

ClaudeAgentOptions(
    model=cfg.agents.generator_model,
    allowed_tools=["Write", "Read", "Edit", "Bash", "Glob"],
    cwd=str(project_dir), permission_mode="bypassPermissions",
)

After generation, call self_assess() — catches gaps before the Evaluator via submit_assessment MCP tool. If not confident → extra pass with concerns as strategic_framing.


Phase 4: Evaluator Agent

Load: $SKILL_DIR/instructions/agent-patterns.md for full implementation. Load: $SKILL_DIR/instructions/evaluation-rubrics.md for system prompt + rubric criteria.

Two roles: run_evaluator() (post-generation gate) + review_contract() (pre-sprint criteria review).

# submit_grade schema: contract_results[{id, status, evidence}], rubric_scores{id: 1–5}, feedback
def run_evaluator(spec, contract, app_url, rubric_track="A", cfg=None) -> EvalResult: ...

⚠️ Deterministic verdict: Never trust verdict from the LLM. Recompute in _build_eval_result() from contract_results + rubric_scores using cfg.verdict.* thresholds.


Phase 5: Harness Loop

Load: $SKILL_DIR/instructions/iteration-loop.md for run_sprint, strategic_decision, git_commit.

def main():
    cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
    log.setup(PROJECT_DIR, label="run")

def run_sprint(spec, contract, project_dir, handoff=None, cfg=None):
    while iteration < cfg.loop.max_iterations:
        # 1. Generate — try/except; crash is a valid (poor) outcome
        # 2. Self-assess — extra pass if not confident
        # 3. git_commit("wip: sprint N iter I")
        # 4. Evaluate → EvalResult
        # 5a. Pass + iteration < min_iterations → quality-improvement continue
        #     Pass + min_iterations met → git_commit("feat") + return
        # 5b. Fail → strategic_decision() → REFINE or PIVOT → set strategic_framing
    # Exhausted: input() if isatty() else return last result

Git checkpoints (see iteration-loop.md for git_commit() helper):

EventMessage
SPEC writtenfeat: generate SPEC.md
Contract negotiatedchore: sprint N contract
Each iterationwip: sprint N iteration I
Sprint passesfeat: sprint N complete

Setup: pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-... Verify: python -c "from agents.planner import run_planner; print('OK')"


Common Mistakes

MistakeFix
Trusting LLM's verdict fieldRecompute in _build_eval_result() from contract_results + rubric_scores
Hardcoding model namesUse cfg.agents.generator_model — never a string literal
Not calling handoff.save() before EvaluatorOn crash, Evaluator result is lost
Using input() in CIGuard with sys.stdin.isatty() first
Accumulating messages across sprintsEach sprint is a fresh query() call — no cross-sprint history
Marking completed_features from Generator claimOnly promote after Evaluator PASS verdict

When to Simplify

ComponentRemove / simplify when
Planner agentUser provides SPEC directly
Contract negotiationHuman has strong opinions; use config-file mode
Generator self-assessmentEvaluator consistently passes first attempt
max_iterations → 3Correctness-only task, no quality/aesthetic goal
min_iterations → 1Early passes are always good enough
Refine/pivot strategic_decisionSingle sprint or correctness task
HandoffStateSprint fits in one context window
EvaluatorTask within Generator's reliable baseline

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Trust Escrow

Create and manage USDC escrows for agent-to-agent payments on Base Sepolia. 30% gas savings, batch operations, dispute resolution.

Registry SourceRecently Updated
Automation

AgentOS SDK for Clawdbot

AgentOS SDK for Clawdbot enables full context syncing, memory persistence, project tracking, mesh messaging, and dashboard access via mandatory heartbeat syncs.

Registry SourceRecently Updated
Automation

AgentOS Mesh

Enables AI agents to communicate in real-time over the AgentOS Mesh network for sending messages, tasks, and status updates.

Registry SourceRecently Updated
Automation

MemoryLayer

Semantic memory for AI agents. 95% token savings with vector search.

Registry SourceRecently Updated