healer

Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "healer" with this command: npx skills add 5dlabs/cto/5dlabs-cto-healer

Healer Skill

Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.

When to Use

  • Monitoring Play workflow execution

  • Debugging agent failures (pre-flight, runtime)

  • Understanding detection patterns (A10, A11, A12)

  • Checking session status

Healer API Endpoints

Endpoint Method Purpose

/health

GET Health check

/api/v1/session/start

POST MCP calls this on play()

/api/v1/session/{play_id}

GET Get session details

/api/v1/sessions

GET List all sessions

/api/v1/sessions/active

GET List active sessions only

Check Active Sessions

curl http://localhost:8083/api/v1/sessions/active | jq

Detection Patterns

Priority 1: Pre-Flight Failures (within 60s of agent start)

Pattern Alert Code Meaning

tool inventory mismatch

A10 Agent missing declared tools

Tool inventory MISMATCH

A10 Specific tool unavailable

declared tools.*missing

A10 Tools in config not in CLI

cto-config.*(missing|invalid)

A11 Config not loaded/synced

mcp.*failed to initialize

A12 MCP server init failure

tools-server.*unreachable

A12 Tools-server down

Priority 2: Runtime Failures

Pattern Severity Action

panicked at , fatal error

Critical Immediate escalation

timeout , connection refused

High Infrastructure issue

max retries exceeded

High Agent exhausted attempts

permission denied.*filesystem

Critical Can't read/write files

unauthorized|invalid token

Critical Auth broken

Priority 3: Lifecycle Issues

Pattern Meaning

template not found

Prompt template missing

prompt.*missing

Agent instructions not loaded

role.*undefined

Agent role not set

task context.*empty

Task details not injected

Dual-Model Architecture

┌─────────────────────────────────────────────────────────────────────────────┐ │ DUAL-MODEL HEALER ARCHITECTURE │ │ │ │ DATA SOURCES │ │ ├─ Loki (all pod logs) │ │ ├─ Kubernetes (CodeRuns, Pods, Events) │ │ ├─ GitHub (PRs, comments, CI status) │ │ └─ CTO Config (expected tools, agent settings) │ │ │ │ │ ▼ │ │ MODEL 1: EVALUATION AGENT │ │ ├─ Parses and comprehends ALL logs │ │ ├─ Correlates events across agents │ │ ├─ Identifies root cause │ │ └─ Creates GitHub Issue with analysis │ │ │ │ │ ▼ │ │ MODEL 2: REMEDIATION AGENT │ │ ├─ Reads the GitHub issue │ │ ├─ Implements the fix │ │ ├─ Creates PR with changes │ │ └─ Marks issue resolved │ └─────────────────────────────────────────────────────────────────────────────┘

Session Notification Flow

MCP play() call │ ▼ POST /api/v1/session/start │ └─ Payload: { play_id, repository, cto_config: { agents, tools }, tasks: [...] } │ ▼ Healer stores session with expected tools per agent │ ▼ CodeRuns start with Healer already aware

Watch Logs

Pod Logs

Watch all CTO pods

kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100

Watch specific agent CodeRun

kubectl logs -n cto -l app=coderun -f

Loki Query

{namespace="cto"} |= "error" | json

Pre-Flight Checklist (Verify within 60s)

For every agent run, Healer verifies:

Prompts

  • Agent type identified

  • Role matches task

  • Template loaded

  • Language context set

MCP Tools (from CTO Config)

  • CTO config loaded

  • Remote tools accessible

  • Local servers initialized

  • Tools-server reachable

Escalation

When issues detected:

  • Evaluation Agent creates GitHub issue with root cause

  • Remediation Agent attempts fix (if automatable)

  • Discord notification for P0/P1 critical issues

  • Human escalation if remediation fails

Configuration

In cto-config.json :

{ "defaults": { "play": { "healerEndpoint": "http://localhost:8083" }, "remediation": { "maxIterations": 3, "syncTimeoutSecs": 300 } } }

Reference Documentation

  • docs/heal-play.md - Full Healer specification

  • crates/healer/ - Healer implementation

  • crates/healer/src/scanner.rs - Detection patterns

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

linear-agent-api

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

argocd-gitops

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

multi-agent-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

parallel-agents

No summary provided by upstream source.

Repository SourceNeeds Review