Error Recovery Protocol

When an error occurs, stop, think, and try the right recovery strategy. No blind retries — understand the error signal first, then act.

Core principle: Every error carries a signal. Read the signal first, then act.

Error Classification

Classify every error into one of 4 categories — the recovery strategy depends on the category:

Transient Error

Retrying usually fixes it. Infrastructure or network related.

Examples: timeout, rate limit (429), connection drop, temporary service outage
Strategy: Wait & Retry with exponential backoff

Configuration Error

Environment or setup issue. Code is correct but setup is wrong.

Examples: missing env variable, wrong file path, permission denied, missing dependency
Strategy: Fix & Continue — identify the issue, fix it, re-run

Logic Error

Code or approach is wrong. Retrying produces the same error.

Examples: KeyError, TypeError, wrong algorithm, expectation mismatch
Strategy: Alternative Approach — try a different method

Permanent / External Error

Out of control, cannot be fixed. External service or permission boundary.

Examples: 403 Forbidden, 404 Not Found, quota exceeded, API deprecated
Strategy: Escalation — inform the user, ask for direction

Retry Strategy

For transient errors, use exponential backoff:

Attempt 1: Retry immediately
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds -> move on or escalate

Maximum retries: 3 attempts. If all 3 fail → re-evaluate the category.

Rate limit (429) special rule:

If response has Retry-After header, wait that duration
Otherwise wait 60 seconds, then retry

Decision Tree

Error received
    |
Classify the error
    |
+------------------------------------+
| Transient?  -> Wait & Retry (max 3)|
| Config?     -> Fix & Continue      |
| Logic?      -> Alternative approach|
| Permanent?  -> Escalation          |
+------------------------------------+
    |
Every strategy fails -> Escalation

Escalation Protocol

Escalate to the user when:

3 retries failed
Permanent / external error
2 consecutive different strategies failed
Error category cannot be determined

ERROR ESCALATION
================================
Failed step : [step name]
Error       : [error message summary]
Category    : [Transient / Config / Logic / Permanent]
Tried       : [what was attempted — short list]
Result      : All strategies exhausted
================================
Options:
  A) [Alternative approach suggestion]
  B) [Simpler / partial solution]
  C) Skip this step, continue
  D) Stop the task

Partial Success

For bulk operations where some items succeed and some fail:

PARTIAL SUCCESS
================================
Successful : N / Total
Failed     : M items
================================
Failed items:
  - [item]: [reason]

Options:
  A) Retry only failed items
  B) Continue with successful items, skip failed
  C) Cancel all

Error Log

Log every error and recovery attempt:

[ERROR LOG]
Step     : [step name / number]
Error    : [message]
Category : [type]
Attempt 1: [strategy] -> [result]
Attempt 2: [strategy] -> [result]
Result   : Recovered / Escalated

When to Skip

Error is expected behavior (e.g., "file not found" when checking existence)
User said "ignore errors, continue"
One-off, non-repeatable task

Guardrails

Never blind-retry a logic error — retrying won't help, change the approach.
Always log every attempt — even successful recoveries need a record.
Cross-skill: integrates with checkpoint-guardian (risk assessment before retry), memory-ledger (logs errors and fixes), and agent-reviewer (retrospective analysis).

error-recovery

Safety Notice

Copy this and send it to your AI assistant to learn