Healer Skill
Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.
When to Use
-
Monitoring Play workflow execution
-
Debugging agent failures (pre-flight, runtime)
-
Understanding detection patterns (A10, A11, A12)
-
Checking session status
Healer API Endpoints
Endpoint Method Purpose
/health
GET Health check
/api/v1/session/start
POST MCP calls this on play()
/api/v1/session/{play_id}
GET Get session details
/api/v1/sessions
GET List all sessions
/api/v1/sessions/active
GET List active sessions only
Check Active Sessions
curl http://localhost:8083/api/v1/sessions/active | jq
Detection Patterns
Priority 1: Pre-Flight Failures (within 60s of agent start)
Pattern Alert Code Meaning
tool inventory mismatch
A10 Agent missing declared tools
Tool inventory MISMATCH
A10 Specific tool unavailable
declared tools.*missing
A10 Tools in config not in CLI
cto-config.*(missing|invalid)
A11 Config not loaded/synced
mcp.*failed to initialize
A12 MCP server init failure
tools-server.*unreachable
A12 Tools-server down
Priority 2: Runtime Failures
Pattern Severity Action
panicked at , fatal error
Critical Immediate escalation
timeout , connection refused
High Infrastructure issue
max retries exceeded
High Agent exhausted attempts
permission denied.*filesystem
Critical Can't read/write files
unauthorized|invalid token
Critical Auth broken
Priority 3: Lifecycle Issues
Pattern Meaning
template not found
Prompt template missing
prompt.*missing
Agent instructions not loaded
role.*undefined
Agent role not set
task context.*empty
Task details not injected
Dual-Model Architecture
┌─────────────────────────────────────────────────────────────────────────────┐ │ DUAL-MODEL HEALER ARCHITECTURE │ │ │ │ DATA SOURCES │ │ ├─ Loki (all pod logs) │ │ ├─ Kubernetes (CodeRuns, Pods, Events) │ │ ├─ GitHub (PRs, comments, CI status) │ │ └─ CTO Config (expected tools, agent settings) │ │ │ │ │ ▼ │ │ MODEL 1: EVALUATION AGENT │ │ ├─ Parses and comprehends ALL logs │ │ ├─ Correlates events across agents │ │ ├─ Identifies root cause │ │ └─ Creates GitHub Issue with analysis │ │ │ │ │ ▼ │ │ MODEL 2: REMEDIATION AGENT │ │ ├─ Reads the GitHub issue │ │ ├─ Implements the fix │ │ ├─ Creates PR with changes │ │ └─ Marks issue resolved │ └─────────────────────────────────────────────────────────────────────────────┘
Session Notification Flow
MCP play() call │ ▼ POST /api/v1/session/start │ └─ Payload: { play_id, repository, cto_config: { agents, tools }, tasks: [...] } │ ▼ Healer stores session with expected tools per agent │ ▼ CodeRuns start with Healer already aware
Watch Logs
Pod Logs
Watch all CTO pods
kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100
Watch specific agent CodeRun
kubectl logs -n cto -l app=coderun -f
Loki Query
{namespace="cto"} |= "error" | json
Pre-Flight Checklist (Verify within 60s)
For every agent run, Healer verifies:
Prompts
-
Agent type identified
-
Role matches task
-
Template loaded
-
Language context set
MCP Tools (from CTO Config)
-
CTO config loaded
-
Remote tools accessible
-
Local servers initialized
-
Tools-server reachable
Escalation
When issues detected:
-
Evaluation Agent creates GitHub issue with root cause
-
Remediation Agent attempts fix (if automatable)
-
Discord notification for P0/P1 critical issues
-
Human escalation if remediation fails
Configuration
In cto-config.json :
{ "defaults": { "play": { "healerEndpoint": "http://localhost:8083" }, "remediation": { "maxIterations": 3, "syncTimeoutSecs": 300 } } }
Reference Documentation
-
docs/heal-play.md - Full Healer specification
-
crates/healer/ - Healer implementation
-
crates/healer/src/scanner.rs - Detection patterns