When to use this skill
Use this skill when the user wants to:
- Diagnose why a multi-agent swarm feels "stuck" or partially offline
- Check gateway + channel + lane liveness in one run
- Perform bounded auto-recovery (restart + retry only)
- Capture auditable receipts for incident timelines
- Keep a primary watchdog lane plus a backup lane in place
Commands
# Install/refresh watchdog scripts + cron wiring
bash skills/swarm-self-heal/scripts/setup.sh
# Run an immediate canary check
bash skills/swarm-self-heal/scripts/check.sh
# Run watchdog directly (uses deployed workspace path)
bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh
# Optional: increase lane ping timeout for slower providers
PING_TIMEOUT_SECONDS=180 bash ~/.openclaw/workspace-studio/scripts/anvil_watchdog.sh
What it checks
- Gateway health via
openclaw health - Channel readiness via
openclaw channels status --json --probe - Passive lane recency via
openclaw status --json(latest OpenClaw-compatible) - Active lane probe only when stale for
main,builder-1,builder-2,reviewer,designer - Bounded recovery with a single restart pass + targeted re-probe of infra failures
Output contract
The watchdog output includes:
timestamptargetsok_agentsfailed_agentsactionsVERDICTRECEIPT
Safety model
- Bounded recovery only (single restart pass per run)
- No destructive state wipes
- No blind reinstall behavior
- Recovery actions are explicit in output
Notes
- Cron wiring sets both primary and backup watchdog lanes to
xhighthinking. - Telegram target is auto-derived from config when available, with a safe fallback.
- Healthy runs can be summarized as a single line to reduce operator noise.