swarm-self-heal

When to use this skill

Use this skill when the user wants to:

Diagnose why a multi-agent swarm feels "stuck" or partially offline
Check gateway + channel + lane liveness in one run
Perform bounded auto-recovery (restart + retry only)
Capture auditable receipts for incident timelines
Keep a primary watchdog lane plus a backup lane in place

Commands

# Install/refresh watchdog scripts + cron wiring
bash skills/swarm-self-heal/scripts/setup.sh

# Run an immediate canary check
bash skills/swarm-self-heal/scripts/check.sh

# Run watchdog directly (uses deployed workspace path)
bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh

# Optional: increase lane ping timeout for slower providers
PING_TIMEOUT_SECONDS=180 bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh

What it checks

Gateway health via openclaw health
Channel readiness via openclaw channels status --json --probe
Passive lane recency via openclaw status --json (latest OpenClaw-compatible)
Active lane probe only when stale for main, builder-1, builder-2, reviewer, designer
Bounded recovery with a single restart pass + targeted re-probe of infra failures

Output contract

The watchdog output includes:

timestamp
targets
ok_agents
failed_agents
actions
VERDICT
RECEIPT

Safety model

Bounded recovery only (single restart pass per run)
No destructive state wipes
No blind reinstall behavior
Recovery actions are explicit in output

Notes

Cron wiring sets both primary and backup watchdog lanes to xhigh thinking.
Telegram target is auto-derived from config when available, with a safe fallback.
Healthy runs can be summarized as a single line to reduce operator noise.

swarm-self-heal

Safety Notice

Copy this and send it to your AI assistant to learn

When to use this skill

Commands

What it checks

Output contract

Safety model

Notes

Source Transparency

Related Skills

Alephnet Node

Keep Protocol

Gateway Watchdog Lite