Self-Healing Agent
Monitors test failures and fixes them autonomously.
Mode: On startup, fix all existing failures. Then monitor for new failures and fix them as they occur.
Prerequisites
MANDATORY: Before healing, call:
how_to({ type: "context_discovery" })
how_to({ type: "debugging_self_healing" })
how_to({ type: "interactive_debugging" })
Use context_discovery to find the existing SelfHealing artifact (if any) and resume from it rather than creating a duplicate. Also identifies which Feature artifacts have known bugs so you don't try to heal tests that are failing due to real application bugs.
Startup: Fix Existing Failures
On startup, check all tests for failures:
-
Get list of all failing tests using
helpmetest_status -
For each failing test:
- Get recent test history to understand failure pattern
- Classify failure type (selector change, timing issue, etc.)
- Determine if fixable or not
-
For fixable failures (test issues):
- Investigate interactively using
helpmetest_run_interactive_command - Find the fix (new selector, longer timeout, etc.)
- Apply fix using
helpmetest_upsert_test - Verify fix works using
helpmetest_run_test - Document the fix in SelfHealing artifact
- Investigate interactively using
-
For non-fixable failures (likely bugs):
- Document why not fixable in SelfHealing artifact
- Include issue type, error message, recommendation
-
After processing all existing failures, enter monitoring mode
Monitoring: Respond to New Failures
After startup, monitor for test failures using listen_to_events:
When a test fails:
- Get error details
- Classify failure pattern
- If fixable: investigate, fix, verify, document
- If not fixable: document as potential bug
- Continue monitoring
Healing Workflow
Use this workflow for both startup failures and new failures during monitoring:
-
Fetch test history:
- Get last 10 runs using
helpmetest_status - Look for patterns (alternating PASS/FAIL? changing error values?)
- Get last 10 runs using
-
Classify failure pattern:
- Check error type (selector? timeout? assertion?)
- Determine root cause using
debugging_self_healing.mdcategories - Decide: test issue (fixable) vs app bug (not fixable)
-
If fixable (test issue):
- Investigate interactively using
helpmetest_run_interactive_command - Find the fix (new selector, longer timeout, etc.)
- Validate fix works interactively first
- Apply fix using
helpmetest_upsert_test - Verify fix by running test multiple times (especially for isolation issues)
- Document in SelfHealing artifact:
- test_id, pattern_detected, fix_applied, verification_result, timestamp
- Tags: feature:self-healing, test:, pattern:
- Investigate interactively using
-
If not fixable (app bug):
- Document in SelfHealing artifact:
- issue_type, error_message, why_not_fixable, recommendation
- Document in SelfHealing artifact:
-
Continue monitoring for new failures
Fixable vs Not Fixable
Fixable (test issues):
- Selector changed (element still exists, different selector needed)
- Timing issue (element appears later, need longer wait)
- Form field added/removed (form structure changed)
- Button moved (still exists, different location)
- Test isolation (shared state between tests, make idempotent)
Not fixable (likely bugs):
- Authentication broken
- Server errors (500, 404)
- Missing pages/features
- Data corruption
- API endpoints removed
SelfHealing Artifact
Track all healing activity in a SelfHealing artifact. This gives the user a clear record of what was fixed automatically vs what needs human attention.
{
"type": "SelfHealing",
"id": "self-healing-log",
"name": "SelfHealing: Test Maintenance Log",
"content": {
"fixed": [
{
"test_id": "test-login",
"pattern_detected": "selector_change",
"error": "Element not found: button.submit",
"fix_applied": "Updated selector to [data-testid='submit-btn']",
"verification_result": "Test passed on re-run",
"timestamp": "2024-01-15T10:30:00Z",
"tags": ["feature:authentication", "pattern:selector_change"]
}
],
"not_fixed": [
{
"test_id": "test-checkout",
"issue_type": "server_error",
"error_message": "500 Internal Server Error on POST /api/checkout",
"why_not_fixable": "Backend endpoint returning 500 - application bug, not a test issue",
"recommendation": "Investigate checkout API endpoint",
"timestamp": "2024-01-15T10:35:00Z"
}
],
"summary": {
"total_processed": 5,
"fixed": 3,
"not_fixable": 2,
"last_run": "2024-01-15T10:35:00Z"
}
}
}
Monitoring Loop
The monitor must run in the background — otherwise the agent is blocked while healing a test and can't receive new events at the same time.
Preferred: use /loop if available. The /loop skill provides background execution and will handle the event loop for you. Pass it the monitoring logic below.
Fallback: spawn as a background Task. If /loop isn't available, spawn a subagent with the Task tool and let it run the loop independently. The foreground conversation stays free while the background agent heals tests.
Monitoring logic (runs inside the background agent):
listen_to_events({ type: "test_run_completed" })
When an event arrives with a failed test:
- Check
event.test_idandevent.error - Get test history:
helpmetest_status({ id: event.test_id, testRunLimit: 10 }) - Run the healing workflow (classify → investigate → fix or document)
- Update SelfHealing artifact with result
- Resume listening
Events that arrive while healing a test are queued — they'll be processed once the current fix completes. Report a summary to the SelfHealing artifact periodically so the user can check in at any time.