Agent Orchestration
Production-tested patterns for coordinating multiple AI agents and models. This skill covers the full spectrum from simple fallback chains to complex multi-model workflows with cross-validation and quality control loops.
When to Use
- Coordinating 2+ agents or models on a single workflow
- Building QC loops where one model checks another's work
- Routing tasks to the right model based on task type
- Setting up fallback chains for reliability
- Optimizing cost across subscription and API models
- Configuring ACPX (Agent Computer Protocol eXtended) for Claude Code and Codex
- Designing spawn patterns for runtime sub-agents
When NOT to Use
- Single-agent prompting or prompt engineering (use a prompt-engineering skill)
- Fine-tuning or training models (different domain entirely)
- Simple API calls to one model (just call the API)
- RAG or retrieval pipeline design (use a RAG-specific skill)
- Agent memory architecture (use the agent-memory-architecture skill)
1. Sub-Agent QC Workflow
The core pattern: Produce → Review → Cross-Check → Incorporate → Deliver.
The Five-Step Loop
┌─────────────┐
│ 1. PRODUCE │ Sonnet 4.6 generates first draft
│ (Grinder) │ Fast, cost-effective, good enough for 80% of tasks
└──────┬──────┘
▼
┌─────────────┐
│ 2. REVIEW │ Same model self-reviews against criteria
│ (Self-QC) │ Catches obvious errors, formatting issues
└──────┬──────┘
▼
┌─────────────┐
│ 3. CROSS │ Different model (GPT-4o / Grok) validates
│ CHECK │ Catches blind spots, model-specific biases
└──────┬──────┘
▼
┌─────────────┐
│ 4. INCORP. │ Opus 4.6 synthesizes feedback
│ (Orchestr.) │ Resolves conflicts, applies judgment
└──────┬──────┘
▼
┌─────────────┐
│ 5. DELIVER │ Final output with confidence score
│ (Output) │ Includes provenance trail
└─────────────┘
Implementation Example
async def qc_workflow(task: str, context: dict) -> dict:
"""Five-step QC workflow with cross-model validation."""
# Step 1: Produce (Sonnet — fast, cheap)
draft = await call_model(
model="claude-sonnet-4-6",
prompt=f"Complete this task:\n{task}",
context=context,
max_tokens=4096
)
# Step 2: Self-review (same model, different prompt)
self_review = await call_model(
model="claude-sonnet-4-6",
prompt=f"""Review this output for errors, omissions, and quality:
TASK: {task}
OUTPUT: {draft}
Score 1-10 on: accuracy, completeness, clarity.
List specific issues to fix.""",
max_tokens=1024
)
# Step 3: Cross-check (different model family)
cross_check = await call_model(
model="gpt-4o",
prompt=f"""Independent review. Do NOT assume the draft is correct.
TASK: {task}
DRAFT: {draft}
SELF-REVIEW: {self_review}
Identify: factual errors, logical gaps, missing context, biases.""",
max_tokens=1024
)
# Step 4: Incorporate (Opus — best judgment)
final = await call_model(
model="claude-opus-4-6",
prompt=f"""Synthesize and produce final output.
TASK: {task}
DRAFT: {draft}
SELF-REVIEW: {self_review}
CROSS-CHECK: {cross_check}
Resolve any conflicts. Produce the best possible final output.
Include a confidence score (0-100) and list any unresolved concerns.""",
max_tokens=4096
)
# Step 5: Deliver with metadata
return {
"output": final,
"provenance": {
"producer": "claude-sonnet-4-6",
"reviewer": "claude-sonnet-4-6",
"cross_checker": "gpt-4o",
"synthesizer": "claude-opus-4-6",
"steps_completed": 5
}
}
When to Skip Steps
| Scenario | Skip | Rationale |
|---|---|---|
| Low-stakes internal task | Steps 3-4 | Self-review is sufficient |
| Time-critical (<30s budget) | Steps 2-4 | Single model, accept risk |
| High-stakes client deliverable | None | Full loop, every time |
| Coding task with tests | Step 3 | Tests serve as cross-check |
| Creative/subjective work | Step 3 | Cross-check adds noise, not signal |
2. Model Staggering
Assign models to tasks based on their demonstrated strengths.
The Model Roster
Model Strength Zone Cost Tier Speed
────────────────────────────────────────────────────────────────
Opus 4.6 Strategy, synthesis, $$$$$ Slow
complex reasoning,
judgment calls
Sonnet 4.6 Production work, coding, $$$ Fast
analysis, writing,
general-purpose grinder
GPT-4o Coding, scoring rubrics, $$$$ Medium
structured output,
alternative perspective
Grok X/Twitter analysis, $$ Fast
social media content,
real-time commentary
Gemini 2.5 Pro Deep research, long $$$ Medium
context analysis,
multimodal processing
Haiku 4.5 Classification, routing, $ Very Fast
simple extraction,
high-volume tasks
Task Routing Rules
routing_rules:
# Strategic / High-judgment tasks → Opus
strategy:
models: [claude-opus-4-6]
triggers:
- "requires judgment between competing priorities"
- "synthesize conflicting information"
- "make a recommendation with tradeoffs"
- "review and improve another agent's work"
# Production work → Sonnet
production:
models: [claude-sonnet-4-6]
triggers:
- "write code to specification"
- "generate content from template"
- "analyze data and report findings"
- "standard business communication"
# Coding with scoring → GPT
coding_and_scoring:
models: [gpt-4o]
triggers:
- "write and debug complex algorithms"
- "score outputs against rubric"
- "generate structured JSON/YAML"
- "cross-validate another model's output"
# Social / real-time → Grok
social:
models: [grok-3]
triggers:
- "analyze X/Twitter trends"
- "generate social media content"
- "real-time event commentary"
- "meme-aware communication"
# Deep research → Gemini
research:
models: [gemini-2.5-pro]
triggers:
- "analyze documents >100K tokens"
- "cross-reference multiple long sources"
- "multimodal analysis (images + text)"
- "broad research synthesis"
# High-volume classification → Haiku
classification:
models: [claude-haiku-4-5]
triggers:
- "classify items into categories"
- "extract structured fields from text"
- "route incoming requests"
- "simple yes/no decisions"
Staggering in Practice
Example: "Write a market analysis report"
1. Gemini 2.5 Pro → Research phase (long context, web search)
2. Sonnet 4.6 → Draft the report (fast production)
3. GPT-4o → Score against quality rubric (structured eval)
4. Opus 4.6 → Final synthesis and executive summary (judgment)
5. Haiku 4.5 → Extract key metrics into structured JSON (cheap, fast)
3. Fallback Chains
When a model is unavailable, rate-limited, or returns low-quality output, fall through to the next option.
Chain Configuration
fallback_chains:
# Primary reasoning chain
reasoning:
- model: claude-opus-4-6
timeout: 60s
retry: 1
- model: gpt-4o
timeout: 45s
retry: 1
- model: claude-sonnet-4-6
timeout: 30s
retry: 2
- model: gemini-2.5-pro
timeout: 45s
retry: 1
# Fast production chain
production:
- model: claude-sonnet-4-6
timeout: 30s
retry: 2
- model: gpt-4o
timeout: 30s
retry: 1
- model: grok-3
timeout: 20s
retry: 1
# Classification chain (optimize for cost)
classification:
- model: claude-haiku-4-5
timeout: 10s
retry: 3
- model: claude-sonnet-4-6
timeout: 15s
retry: 1
Fallback Decision Logic
async def call_with_fallback(chain: str, prompt: str) -> dict:
"""Try models in order until one succeeds with acceptable quality."""
for entry in CHAINS[chain]:
for attempt in range(entry["retry"] + 1):
try:
result = await call_model(
model=entry["model"],
prompt=prompt,
timeout=entry["timeout"]
)
# Quality gate: reject low-confidence outputs
if result.get("confidence", 100) < 30:
log(f"{entry['model']} returned low confidence, trying next")
break # Move to next model, don't retry
return {
"output": result,
"model_used": entry["model"],
"attempt": attempt + 1,
"fallback_depth": CHAINS[chain].index(entry)
}
except (TimeoutError, RateLimitError) as e:
log(f"{entry['model']} attempt {attempt+1} failed: {e}")
continue
raise AllModelsFailed(f"No model in chain '{chain}' produced acceptable output")
4. ACPX Configuration
ACPX (Agent Computer Protocol eXtended) enables tool-using agents to coordinate. Configuration for Claude Code and Codex environments.
Claude Code Configuration
In your project's CLAUDE.md:
# Agent Orchestration
## Sub-agent Spawning
When a task requires cross-model validation:
1. Use the Agent tool to spawn a sub-agent for the secondary task
2. The sub-agent inherits the project context but gets its own conversation
3. Results flow back to the orchestrator via the Agent tool response
## Model Selection
- Use claude-opus-4-6 for: architectural decisions, code review, complex debugging
- Use claude-sonnet-4-6 for: implementation, test writing, documentation
- Use claude-haiku-4-5 for: linting, formatting, simple refactors
## Tool Permissions
Sub-agents may: read files, search code, run tests
Sub-agents may NOT: push to git, modify CI/CD, delete files without confirmation
ACP Server Setup
{
"mcpServers": {
"orchestrator": {
"command": "node",
"args": ["./orchestrator-server.js"],
"env": {
"ANTHROPIC_API_KEY": "${ANTHROPIC_API_KEY}",
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"MAX_CONCURRENT_AGENTS": "5",
"DEFAULT_CHAIN": "production"
}
}
}
}
Codex Integration
# codex.yaml
agents:
orchestrator:
model: claude-opus-4-6
role: "Route tasks and synthesize results"
tools: [spawn_agent, review_output, merge_results]
grinder:
model: claude-sonnet-4-6
role: "Execute implementation tasks"
tools: [read_file, write_file, run_tests, search_code]
validator:
model: gpt-4o
role: "Cross-validate outputs"
tools: [read_file, run_tests, score_output]
5. Cost Optimization
Subscription vs API Economics
Subscription Models ($20-200/month flat):
Claude Pro/Max → Best for: daily interactive use, long sessions
ChatGPT Plus → Best for: GPT-4o access, plugins
Grok Premium → Best for: X integration, real-time
Gemini Advanced → Best for: Google ecosystem, long context
API Models (per-token):
claude-opus-4-6 → $15/M input, $75/M output
claude-sonnet-4-6 → $3/M input, $15/M output
claude-haiku-4-5 → $0.80/M input, $4/M output
gpt-4o → $2.50/M input, $10/M output
$0 Marginal Cost Routing
When you have active subscriptions, route interactive and exploratory work through subscriptions (zero marginal cost) and reserve API for automated/batch workflows.
Decision Tree:
Is this interactive/exploratory?
YES → Route through subscription (Claude Code, ChatGPT, etc.)
NO → Is this batch/automated?
YES → Use API with cheapest adequate model
NO → Is this high-volume (>1000 calls/day)?
YES → Use Haiku via API ($0.80/M input)
NO → Use Sonnet via API ($3/M input)
Cost Tracking Template
Monthly AI Spend:
Subscriptions (fixed):
Claude Max $200.00
ChatGPT Plus $20.00
Grok Premium $30.00
Gemini Advanced $20.00
Subtotal Fixed $270.00
API Usage (variable):
Opus 4.6 42K tokens $3.78
Sonnet 4.6 380K tokens $6.84
Haiku 4.5 1.2M tokens $1.76
GPT-4o 95K tokens $1.19
Subtotal Variable $13.57
Total $283.57
Cost per task (avg) $0.28
Tasks completed 1,013
6. Spawn Patterns
Pattern 1: Runtime Sub-Agent (Within Claude Code)
Use the Agent tool to spawn sub-agents that inherit project context.
Orchestrator (Opus)
├── Agent: "Research the API surface" (Explore subagent)
├── Agent: "Implement the endpoint" (general-purpose subagent)
└── Agent: "Write tests" (general-purpose subagent)
Best for: tasks where sub-agents need file system access and project context.
Pattern 2: API-Spawned Agent (External)
Call model APIs directly for tasks that don't need project context.
# Spawn multiple validators in parallel
import asyncio
async def parallel_validate(content: str) -> list:
tasks = [
call_model("claude-sonnet-4-6", f"Review for accuracy:\n{content}"),
call_model("gpt-4o", f"Review for accuracy:\n{content}"),
call_model("gemini-2.5-pro", f"Review for accuracy:\n{content}"),
]
return await asyncio.gather(*tasks)
Best for: cross-validation, scoring, classification — tasks that are self-contained.
Pattern 3: Orchestrator-Grinder Split
The orchestrator plans and delegates. Grinders execute. Never let a grinder make strategic decisions.
ORCHESTRATOR (Opus 4.6):
- Reads the task requirements
- Breaks into subtasks
- Assigns each subtask to appropriate grinder
- Reviews grinder outputs
- Synthesizes final deliverable
- Makes judgment calls on conflicts
GRINDER (Sonnet 4.6 / GPT-4o):
- Receives specific, scoped subtask
- Executes without strategic decisions
- Returns output with confidence score
- Flags uncertainty rather than guessing
Anti-Patterns to Avoid
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Grinder makes strategic calls | Inconsistent decisions, wasted work | Escalate to orchestrator |
| Orchestrator does grinder work | Slow, expensive, bottleneck | Delegate production tasks |
| No quality gate between steps | Errors compound through pipeline | Add review step after each stage |
| Same model reviews its own work | Blind spots persist | Cross-model validation |
| Spawning agents for trivial tasks | Overhead exceeds task cost | Direct call for simple tasks |
| Infinite retry loops | Cost explosion | Max 3 retries, then escalate |
7. Orchestrator vs Grinder Principle
This is the foundational principle of multi-agent systems.
The Rule
The orchestrator thinks. The grinder does. Never confuse the two.
Role Definitions
ORCHESTRATOR GRINDER
───────────────────────────────── ─────────────────────────────────
Decides WHAT to do Decides HOW to do it
Chooses which model/tool Uses the tools it's given
Reviews and judges quality Produces and reports confidence
Resolves conflicts between agents Flags conflicts for resolution
Owns the final output Owns its subtask output
Expensive, slow, high-judgment Cheap, fast, high-throughput
1 per workflow N per workflow
Decision Framework
"Should this be an orchestrator or grinder decision?"
Ask: "If two reasonable people disagreed on this, would it matter?"
YES → Orchestrator decision (judgment required)
NO → Grinder decision (execution, not judgment)
Ask: "Does this affect the overall workflow direction?"
YES → Orchestrator decision
NO → Grinder decision
Ask: "Could a junior employee do this with clear instructions?"
YES → Grinder task
NO → Orchestrator task
Example Workflow: Client Deliverable
ORCHESTRATOR (Opus):
1. Read client brief → decide deliverable structure
2. Break into sections → assign to grinders
3. Review all sections → identify gaps
4. Resolve quality issues → request rewrites
5. Synthesize → produce final deliverable
6. Generate executive summary → deliver
GRINDER 1 (Sonnet): Write Section A per outline
GRINDER 2 (Sonnet): Write Section B per outline
GRINDER 3 (GPT-4o): Generate data tables and charts
GRINDER 4 (Gemini): Research background for Section C
GRINDER 5 (Haiku): Format citations and references
Total cost: 1 Opus call (synthesis) + 5 cheaper calls (production) vs. doing everything in Opus: 6 Opus calls at 5x the cost.