Long-Running App Harness — SDK Implementation
Produces a runnable harness that orchestrates Claude agents via claude_agent_sdk.
You are writing the harness, not running inside it.
Use query() + ClaudeAgentOptions for agentic loops; tool() + create_sdk_mcp_server()
for structured output. Never anthropic.Anthropic() directly.
pip install claude-agent-sdk
Output structure:
harness/
harness.py; config.yaml; config.py; log.py
agents/ planner.py; generator.py; evaluator.py
models/ state.py
prompts/ planner.md; generator.md; evaluator.md
Routing
| User Signal | Route |
|---|---|
| "build a harness / pipeline" | Start at Phase 1 |
| "add an evaluator" | Jump to Phase 4 |
| "add state / handoff" | Jump to Phase 5 |
| "looping forever / broken" | Check feedback loop termination in Phase 5 |
| "just explain what a harness does" | Explain concept, don't write code |
Phase 1: Design the Harness
Load: $SKILL_DIR/instructions/planner-questions.md
⚠️ HARD GATE: Ask the design questions. Get answers to 1–3 before writing any code:
- What does the harness build? (sets Generator tools + Evaluator rubric)
- Python or TypeScript? (default: Python)
- Models per agent? (default: all claude-opus-4-7; non-defaults →
config.yaml)
Create skeleton:
mkdir -p harness/agents harness/models harness/prompts harness/harness-logs
touch harness/harness.py harness/log.py harness/agents/__init__.py harness/models/__init__.py
config.yaml + config.py — all tunable parameters here; never hardcode in agent files.
Load: $SKILL_DIR/instructions/config.md for the full HarnessConfig dataclass.
cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
# Always: cfg.agents.generator_model — never: "claude-opus-4-7"
models/state.py — write first; all other files import from it.
Load: $SKILL_DIR/instructions/context-handoff.md (HandoffState, EvalResult, format_handoff_for_prompt).
Load: $SKILL_DIR/instructions/sprint-contracts.md (SprintContract + negotiation protocol).
log.py — dual stdout + timestamped file under harness-logs/.
Load: $SKILL_DIR/instructions/logging.md for full implementation.
log.setup(PROJECT_DIR, label="run") # once in main()
logger = log.get() # in every agent
Phase 2: Planner Agent
Load: $SKILL_DIR/instructions/planner-questions.md for system prompt template.
Load: $SKILL_DIR/instructions/agent-patterns.md for full run_planner implementation.
run_planner(brief, session_id, cfg) → (reply, new_session_id).
ClaudeAgentOptions(resume=session_id) continues session without resending history.
spec, session_id = "", None
while "SPEC_COMPLETE" not in spec:
user_input = input("[Planner asks]: ").strip() if session_id else initial_brief
spec, session_id = run_planner(user_input, session_id, cfg)
SPEC_PATH.write_text(spec.replace("SPEC_COMPLETE", "").strip())
Phase 3: Generator Agent
Load: $SKILL_DIR/instructions/agent-patterns.md for run_generator + self_assess implementations.
def run_generator(
spec, contract, project_dir,
handoff=None, strategic_framing=None, cfg=None,
) -> str: ...
ClaudeAgentOptions(
model=cfg.agents.generator_model,
allowed_tools=["Write", "Read", "Edit", "Bash", "Glob"],
cwd=str(project_dir), permission_mode="bypassPermissions",
)
After generation, call self_assess() — catches gaps before the Evaluator via
submit_assessment MCP tool. If not confident → extra pass with concerns as strategic_framing.
Phase 4: Evaluator Agent
Load: $SKILL_DIR/instructions/agent-patterns.md for full implementation.
Load: $SKILL_DIR/instructions/evaluation-rubrics.md for system prompt + rubric criteria.
Two roles: run_evaluator() (post-generation gate) + review_contract() (pre-sprint criteria review).
# submit_grade schema: contract_results[{id, status, evidence}], rubric_scores{id: 1–5}, feedback
def run_evaluator(spec, contract, app_url, rubric_track="A", cfg=None) -> EvalResult: ...
⚠️ Deterministic verdict: Never trust verdict from the LLM. Recompute in
_build_eval_result() from contract_results + rubric_scores using cfg.verdict.* thresholds.
Phase 5: Harness Loop
Load: $SKILL_DIR/instructions/iteration-loop.md for run_sprint, strategic_decision, git_commit.
def main():
cfg = HarnessConfig.load(Path(__file__).parent / "config.yaml")
log.setup(PROJECT_DIR, label="run")
def run_sprint(spec, contract, project_dir, handoff=None, cfg=None):
while iteration < cfg.loop.max_iterations:
# 1. Generate — try/except; crash is a valid (poor) outcome
# 2. Self-assess — extra pass if not confident
# 3. git_commit("wip: sprint N iter I")
# 4. Evaluate → EvalResult
# 5a. Pass + iteration < min_iterations → quality-improvement continue
# Pass + min_iterations met → git_commit("feat") + return
# 5b. Fail → strategic_decision() → REFINE or PIVOT → set strategic_framing
# Exhausted: input() if isatty() else return last result
Git checkpoints (see iteration-loop.md for git_commit() helper):
| Event | Message |
|---|---|
| SPEC written | feat: generate SPEC.md |
| Contract negotiated | chore: sprint N contract |
| Each iteration | wip: sprint N iteration I |
| Sprint passes | feat: sprint N complete |
Setup: pip install claude-agent-sdk && export ANTHROPIC_API_KEY=sk-...
Verify: python -c "from agents.planner import run_planner; print('OK')"
Common Mistakes
| Mistake | Fix |
|---|---|
Trusting LLM's verdict field | Recompute in _build_eval_result() from contract_results + rubric_scores |
| Hardcoding model names | Use cfg.agents.generator_model — never a string literal |
Not calling handoff.save() before Evaluator | On crash, Evaluator result is lost |
Using input() in CI | Guard with sys.stdin.isatty() first |
| Accumulating messages across sprints | Each sprint is a fresh query() call — no cross-sprint history |
Marking completed_features from Generator claim | Only promote after Evaluator PASS verdict |
When to Simplify
| Component | Remove / simplify when |
|---|---|
| Planner agent | User provides SPEC directly |
| Contract negotiation | Human has strong opinions; use config-file mode |
| Generator self-assessment | Evaluator consistently passes first attempt |
max_iterations → 3 | Correctness-only task, no quality/aesthetic goal |
min_iterations → 1 | Early passes are always good enough |
Refine/pivot strategic_decision | Single sprint or correctness task |
HandoffState | Sprint fits in one context window |
| Evaluator | Task within Generator's reliable baseline |