error-recovery

Error Handling

When to Use

Implementing input validation at system boundaries
Designing error responses for APIs or user interfaces
Building recovery mechanisms for transient failures
Establishing logging and monitoring patterns
Distinguishing between errors that need user action vs system intervention

Philosophy

Errors are not exceptional - they are expected. Good error handling treats errors as first-class citizens of the system design, not afterthoughts. The goal is to fail safely, provide actionable feedback, and enable recovery.

Error Classification

Operational Errors

Runtime problems that occur during normal operation. These are expected and must be handled.

Characteristics:

External system failures (network, database, filesystem)
Invalid user input
Resource exhaustion (memory, disk, connections)
Timeout conditions
Rate limiting

Response: Handle gracefully, log appropriately, provide user feedback, implement recovery.

Programmer Errors

Bugs in the code that should not happen if the code is correct.

Characteristics:

Type errors caught at runtime
Null/undefined access on required values
Failed assertions on invariants
Invalid internal state

Response: Fail fast, log full context, alert developers. Do not attempt recovery - fix the bug.

Core Patterns

Input Validation

Validate early, validate completely, provide specific feedback.

PATTERN: Fail-Fast Validation

Validate at system boundaries (API entry, user input, file reads)
Check all constraints before processing
Return ALL validation errors, not just the first one
Include field name, actual value (if safe), and expected format
Never trust data from external sources

Validation checklist:

Required fields present
Types correct
Values within allowed ranges
Formats match expectations (email, URL, date)
Business rules satisfied

Error Messages

Different audiences need different information.

User-facing errors:

Clear action the user can take
No technical jargon or stack traces
Consistent tone and format
Localization-ready

Internal/logged errors:

Full technical context
Request/correlation IDs
Timestamp and service identifier
Stack trace for programmer errors
Sanitized sensitive data

Recovery Strategies

Retry with backoff:

Transient failures (network timeouts, rate limits)
Exponential backoff with jitter
Maximum retry count with circuit breaker

Fallback:

Degraded functionality over complete failure
Cached data when live data unavailable
Default values when configuration missing

Compensation:

Undo partial operations on failure
Maintain consistency in distributed operations
Saga pattern for multi-step processes

Logging Levels

Level Use For

ERROR Operational errors requiring attention

WARN Recoverable issues, degraded performance

INFO Significant state changes, request lifecycle

DEBUG Detailed flow for troubleshooting

What to log:

Correlation/request ID
User context (sanitized)
Operation being attempted
Error type and message
Duration and timing

What NOT to log:

Passwords, tokens, secrets
Full credit card numbers
Personal identifiable information (PII)
Raw request/response bodies containing sensitive data

Best Practices

Fail fast on programmer errors - do not mask bugs
Handle operational errors gracefully with recovery options
Provide correlation IDs for tracing requests across services
Use structured logging (JSON) for machine parseability
Centralize error handling logic - avoid scattered try/catch blocks
Test error paths as rigorously as success paths
Monitor error rates and set alerts for anomalies

Anti-Patterns

Catching all exceptions silently (catch {} )
Logging sensitive data in error messages
Returning generic "Something went wrong" without context
Retrying non-idempotent operations without safeguards
Mixing validation errors with system errors in responses
Treating all errors the same regardless of recoverability

References

examples/error-patterns.md - Concrete examples across languages

error-recovery

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

specify-requirements

review

analyze