Healer Skill

Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.

When to Use

Monitoring Play workflow execution
Debugging agent failures (pre-flight, runtime)
Understanding detection patterns (A10, A11, A12)
Checking session status

Healer API Endpoints

Endpoint Method Purpose

/health

GET Health check

/api/v1/session/start

POST MCP calls this on play()

/api/v1/session/{play_id}

GET Get session details

/api/v1/sessions

GET List all sessions

/api/v1/sessions/active

GET List active sessions only

Check Active Sessions

curl http://localhost:8083/api/v1/sessions/active | jq

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

Pattern Alert Code Meaning

tool inventory mismatch

A10 Agent missing declared tools

Tool inventory MISMATCH

A10 Specific tool unavailable

declared tools.*missing

A10 Tools in config not in CLI

cto-config.*(missing|invalid)

A11 Config not loaded/synced

mcp.*failed to initialize

A12 MCP server init failure

tools-server.*unreachable

A12 Tools-server down

Priority 2: Runtime Failures

Pattern Severity Action

panicked at , fatal error

Critical Immediate escalation

timeout , connection refused

High Infrastructure issue

max retries exceeded

High Agent exhausted attempts

permission denied.*filesystem

Critical Can't read/write files

unauthorized|invalid token

Critical Auth broken

Priority 3: Lifecycle Issues

Pattern Meaning

template not found

Prompt template missing

prompt.*missing

Agent instructions not loaded

role.*undefined

Agent role not set

task context.*empty

Task details not injected

Dual-Model Architecture

┌─────────────────────────────────────────────────────────────────────────────┐ │ DUAL-MODEL HEALER ARCHITECTURE │ │ │ │ DATA SOURCES │ │ ├─ Loki (all pod logs) │ │ ├─ Kubernetes (CodeRuns, Pods, Events) │ │ ├─ GitHub (PRs, comments, CI status) │ │ └─ CTO Config (expected tools, agent settings) │ │ │ │ │ ▼ │ │ MODEL 1: EVALUATION AGENT │ │ ├─ Parses and comprehends ALL logs │ │ ├─ Correlates events across agents │ │ ├─ Identifies root cause │ │ └─ Creates GitHub Issue with analysis │ │ │ │ │ ▼ │ │ MODEL 2: REMEDIATION AGENT │ │ ├─ Reads the GitHub issue │ │ ├─ Implements the fix │ │ ├─ Creates PR with changes │ │ └─ Marks issue resolved │ └─────────────────────────────────────────────────────────────────────────────┘

Session Notification Flow

MCP play() call │ ▼ POST /api/v1/session/start │ └─ Payload: { play_id, repository, cto_config: { agents, tools }, tasks: [...] } │ ▼ Healer stores session with expected tools per agent │ ▼ CodeRuns start with Healer already aware

Watch Logs

Pod Logs

Watch all CTO pods

kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100

Watch specific agent CodeRun

kubectl logs -n cto -l app=coderun -f

Loki Query

{namespace="cto"} |= "error" | json

Pre-Flight Checklist (Verify within 60s)

For every agent run, Healer verifies:

Prompts

Agent type identified
Role matches task
Template loaded
Language context set

MCP Tools (from CTO Config)

CTO config loaded
Remote tools accessible
Local servers initialized
Tools-server reachable

Escalation

When issues detected:

Evaluation Agent creates GitHub issue with root cause
Remediation Agent attempts fix (if automatable)
Discord notification for P0/P1 critical issues
Human escalation if remediation fails

Configuration

In cto-config.json :

{ "defaults": { "play": { "healerEndpoint": "http://localhost:8083" }, "remediation": { "maxIterations": 3, "syncTimeoutSecs": 300 } } }

Reference Documentation

docs/heal-play.md - Full Healer specification
crates/healer/ - Healer implementation
crates/healer/src/scanner.rs - Detection patterns

healer

Safety Notice

Copy this and send it to your AI assistant to learn

Watch all CTO pods

Watch specific agent CodeRun

Source Transparency

Related Skills

linear-agent-api

argocd-gitops

multi-agent-patterns

parallel-agents