resilience-patterns

Resilience Patterns

Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.

When to Use This Skill

Implementing circuit breakers
Designing retry strategies
Isolating failures with bulkheads
Building fault-tolerant systems
Handling cascading failures

Why Resilience Matters

In distributed systems, failure is not exceptional—it's normal.

Networks fail. Services crash. Databases timeout. The question isn't IF but WHEN.

Resilience = The ability to handle failures gracefully

Goals:

Prevent cascading failures
Degrade gracefully
Recover automatically
Maintain availability

Core Resilience Patterns

Retry Pattern

What: Automatically retry failed operations When: Transient failures (network blips, temporary unavailability)

Simple retry: ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Request │────►│ Failure │────►│ Retry │───► Success └─────────┘ └─────────┘ └─────────┘

With backoff: Request → Fail → Wait 100ms → Retry Fail → Wait 200ms → Retry Fail → Wait 400ms → Retry Fail → Give up

Backoff strategies:

Fixed: Wait same time each retry
Linear: 100ms, 200ms, 300ms...
Exponential: 100ms, 200ms, 400ms, 800ms...
Exponential + Jitter: Add randomness to prevent thundering herd

Retry Best Practices

Do:

Add jitter to prevent thundering herd
Set maximum retry count
Use exponential backoff
Only retry transient failures
Log retries for visibility

Don't:

Retry non-idempotent operations blindly
Retry client errors (400s)
Retry indefinitely
Use same delay for all retries

Circuit Breaker Pattern

What: Stop calling a failing service temporarily When: Service is consistently failing

States: ┌──────────────────────────────────────────────────────────┐ │ │ │ ┌────────┐ Failures ┌────────┐ Timeout │ │ │ CLOSED │───────────────►│ OPEN │─────────────┐ │ │ │ │ │ │ │ │ │ └────┬───┘ └────────┘ │ │ │ │ ▲ │ │ │ │ │ ▼ │ │ │ Success Failure ┌────────┐ │ │ └────────────────────────────────────►│ HALF │ │ │ Success │ OPEN │ │ │ ◄───────────────┴────────┘ │ │ │ └──────────────────────────────────────────────────────────┘

CLOSED: Normal operation, requests flow through OPEN: Failures exceeded threshold, fail fast HALF-OPEN: Testing if service recovered

Circuit Breaker Configuration

Key parameters:

Failure threshold: How many failures to open

Too low: Opens on minor issues
Too high: Doesn't protect enough
Typical: 5-10 failures or 50% error rate

Timeout (open duration): How long to stay open

Too short: May retry too quickly
Too long: Slow recovery
Typical: 30-60 seconds

Success threshold: Successes to close from half-open

Typically 1-3 successful requests

Bulkhead Pattern

What: Isolate components to contain failures When: Prevent one failure from taking down everything

Ship analogy: ┌─────────────────────────────────────────────┐ │ Ship without bulkheads │ │ ┌───────────────────────────────────────┐ │ │ │ One hole → Entire ship floods │ │ │ └───────────────────────────────────────┘ │ └─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐ │ Ship with bulkheads │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │ X │ │ │ │ │ │ │ │ OK │ │Flood │ │ OK │ │ OK │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ One compartment floods, others stay dry │ └─────────────────────────────────────────────┘

Bulkhead Implementation

Thread pool isolation: ┌────────────────────────────────────────────────────────┐ │ Application │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Service A Pool │ │ Service B Pool │ │ │ │ [10 threads] │ │ [10 threads] │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ ▼ ▼ │ │ Service A Service B │ │ (slow) (healthy) │ └────────────────────────────────────────────────────────┘

Service A being slow doesn't exhaust threads for Service B

Semaphore isolation:

Limit concurrent requests per dependency
Lighter weight than thread pools
Good for async operations

Timeout Pattern

What: Limit how long to wait for operations When: Always (every external call needs a timeout)

Without timeout: Request → Service hangs → Caller waits forever → Resources exhausted

With timeout: Request → Service hangs → Timeout after 5s → Caller handles failure

Timeout types:

Connection timeout: Time to establish connection
Read timeout: Time to receive response
Overall timeout: Total time allowed

Timeout Best Practices

Setting timeouts:

Connection: 1-5 seconds (fast fail)
Read: Based on p99 latency + buffer
Overall: Sum of connection + read + processing

Example: Connection timeout: 2s Read timeout: 10s (p99 is 5s, 2x buffer) Overall timeout: 15s

Cascade consideration: If A calls B calls C:

C timeout < B timeout < A timeout
Each layer has buffer for retries

Fallback Pattern

What: Provide alternative when primary fails When: There's a degraded but acceptable alternative

Fallback options: ┌────────────────────────────────────────────────────────┐ │ Primary fails? Options: │ │ │ │ 1. Cached data: Return stale but valid data │ │ 2. Default value: Return safe default │ │ 3. Degraded service: Reduced functionality │ │ 4. Alternative service: Different provider │ │ 5. Graceful error: Friendly error message │ └────────────────────────────────────────────────────────┘

Example: Primary: Real-time price service Fallback 1: Cached price (< 5 min old) Fallback 2: Last known price with warning Fallback 3: "Price temporarily unavailable"

Rate Limiting Pattern

What: Control the rate of requests When: Protect services from overload

Client-side rate limiting:

Limit outgoing requests
Prevent overwhelming dependencies

Server-side rate limiting:

Limit incoming requests
Protect from traffic spikes

See: rate-limiting-patterns skill for details

Pattern Combinations

Typical Resilience Stack

Request Flow: ┌─────────────────────────────────────────────────────────┐ │ │ │ ┌────────────┐ │ │ │ Timeout │ ← Overall request timeout │ │ │ ┌────────┐ │ │ │ │ │ Retry │ │ ← With exponential backoff │ │ │ │┌──────┐│ │ │ │ │ ││Circuit││ ← Fail fast if service down │ │ │ ││Breaker│ │ │ │ │ │└──────┘│ │ │ │ │ │┌──────┐│ │ │ │ │ ││Bulkhead│ ← Isolate from other calls │ │ │ │└──────┘│ │ │ │ │ └────────┘ │ │ │ └────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Service │ │ │ └──────────┘ │ │ │ │ │ Failure?──────► Fallback │ │ │ └─────────────────────────────────────────────────────────┘

Order of Application

Outer to inner:

Timeout: Overall time limit
Retry: Attempt recovery from transient failures
Circuit Breaker: Stop calling failed services
Bulkhead: Isolate this call from others
[Call service]
Fallback: Handle failures gracefully

Load Shedding

What is Load Shedding?

When system is overloaded:

Accept what you can handle
Reject the rest gracefully
Better to serve some users well than all users poorly

Priority-based shedding:

High priority: Never shed
Medium: Shed under moderate load
Low: Shed first

Implementation

Approaches:

Queue-based:
- Fixed-size queue
- Reject when queue full
- Serve based on priority
Rate-based:
- Maximum requests per second
- Reject when exceeded
- Return 503 or 429
Adaptive:
- Monitor latency/error rate
- Reduce acceptance as stress increases
- Recover as system stabilizes

Graceful Degradation

Levels of Degradation

Level 0: Full functionality └── Everything works normally

Level 1: Non-essential features disabled └── Recommendations off, analytics delayed

Level 2: Reduced functionality └── Read-only mode, cached data only

Level 3: Minimal functionality └── Core features only, no personalization

Level 4: Maintenance mode └── Static page, "be back soon"

Transition: Automatic based on system health metrics or manual via feature flags

Feature Degradation Examples

E-commerce during high load:

Full feature:

Real-time inventory
Personalized recommendations
Live chat support
Detailed analytics

Degraded:

Cached inventory (5 min delay)
Generic recommendations
Contact form only
Analytics queued

Minimal:

Static "in stock" status
No recommendations
Email support only
Analytics dropped

Health Checks

Types of Health Checks

Liveness check: "Is the process alive?"
- Simple ping
- Returns 200 if running
- Used for restart decisions
Readiness check: "Can it handle traffic?"
- Checks dependencies
- Returns 200 if ready
- Used for load balancer
Deep health check: "Is everything working?"
- Comprehensive checks
- May be slower
- Used for monitoring/debugging

Health Check Best Practices

Do:

Keep liveness checks simple and fast
Check all critical dependencies in readiness
Include version/build info in response
Return appropriate status codes

Don't:

Block liveness on dependencies
Include heavy operations in health checks
Expose sensitive information
Forget to handle dependency timeouts

Testing Resilience

How to Test

Unit tests:
- Test each pattern in isolation
- Mock failures
- Verify behavior
Integration tests:
- Test pattern combinations
- Inject failures
- Verify recovery
Chaos engineering:
- Test in production-like environment
- Random failures
- Verify system behavior

See: chaos-engineering-fundamentals skill

Implementation Considerations

Library vs Custom

Libraries (recommended):

Polly (.NET)
Resilience4j (Java)
Hystrix (Java, deprecated)
go-resilience (Go)

Benefits:

Battle-tested
Well-documented
Community support
Metrics built-in

Custom implementation:

Only when specific needs
High maintenance burden
Risk of subtle bugs

Monitoring Resilience

Metrics to track:

Circuit Breaker:

State changes
Open duration
Failure rate

Retries:

Retry count
Retry success rate
Final success/failure

Bulkhead:

Concurrent calls
Rejections
Queue depth

Timeouts:

Timeout count
Latency distribution

Best Practices

Every external call needs a timeout No call should wait forever
Retry only transient failures Don't retry 400 errors
Circuit breaker per dependency Different services need different protection
Bulkhead critical paths Isolate important from less important
Plan fallbacks Know what to do when things fail
Monitor everything Can't fix what you can't see
Test failure paths Happy path tests aren't enough

Related Skills

chaos-engineering-fundamentals
Testing resilience
distributed-transactions
Handling failures in transactions
rate-limiting-patterns
Controlling request rates

resilience-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering