resilience-patterns

Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "resilience-patterns" with this command: npx skills add melodic-software/claude-code-plugins/melodic-software-claude-code-plugins-resilience-patterns

Resilience Patterns

Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.

When to Use This Skill

  • Implementing circuit breakers

  • Designing retry strategies

  • Isolating failures with bulkheads

  • Building fault-tolerant systems

  • Handling cascading failures

Why Resilience Matters

In distributed systems, failure is not exceptional—it's normal.

Networks fail. Services crash. Databases timeout. The question isn't IF but WHEN.

Resilience = The ability to handle failures gracefully

Goals:

  • Prevent cascading failures
  • Degrade gracefully
  • Recover automatically
  • Maintain availability

Core Resilience Patterns

  1. Retry Pattern

What: Automatically retry failed operations When: Transient failures (network blips, temporary unavailability)

Simple retry: ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Request │────►│ Failure │────►│ Retry │───► Success └─────────┘ └─────────┘ └─────────┘

With backoff: Request → Fail → Wait 100ms → Retry Fail → Wait 200ms → Retry Fail → Wait 400ms → Retry Fail → Give up

Backoff strategies:

  • Fixed: Wait same time each retry
  • Linear: 100ms, 200ms, 300ms...
  • Exponential: 100ms, 200ms, 400ms, 800ms...
  • Exponential + Jitter: Add randomness to prevent thundering herd

Retry Best Practices

Do:

  • Add jitter to prevent thundering herd
  • Set maximum retry count
  • Use exponential backoff
  • Only retry transient failures
  • Log retries for visibility

Don't:

  • Retry non-idempotent operations blindly
  • Retry client errors (400s)
  • Retry indefinitely
  • Use same delay for all retries
  1. Circuit Breaker Pattern

What: Stop calling a failing service temporarily When: Service is consistently failing

States: ┌──────────────────────────────────────────────────────────┐ │ │ │ ┌────────┐ Failures ┌────────┐ Timeout │ │ │ CLOSED │───────────────►│ OPEN │─────────────┐ │ │ │ │ │ │ │ │ │ └────┬───┘ └────────┘ │ │ │ │ ▲ │ │ │ │ │ ▼ │ │ │ Success Failure ┌────────┐ │ │ └────────────────────────────────────►│ HALF │ │ │ Success │ OPEN │ │ │ ◄───────────────┴────────┘ │ │ │ └──────────────────────────────────────────────────────────┘

CLOSED: Normal operation, requests flow through OPEN: Failures exceeded threshold, fail fast HALF-OPEN: Testing if service recovered

Circuit Breaker Configuration

Key parameters:

Failure threshold: How many failures to open

  • Too low: Opens on minor issues
  • Too high: Doesn't protect enough
  • Typical: 5-10 failures or 50% error rate

Timeout (open duration): How long to stay open

  • Too short: May retry too quickly
  • Too long: Slow recovery
  • Typical: 30-60 seconds

Success threshold: Successes to close from half-open

  • Typically 1-3 successful requests
  1. Bulkhead Pattern

What: Isolate components to contain failures When: Prevent one failure from taking down everything

Ship analogy: ┌─────────────────────────────────────────────┐ │ Ship without bulkheads │ │ ┌───────────────────────────────────────┐ │ │ │ One hole → Entire ship floods │ │ │ └───────────────────────────────────────┘ │ └─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐ │ Ship with bulkheads │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │ X │ │ │ │ │ │ │ │ OK │ │Flood │ │ OK │ │ OK │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ One compartment floods, others stay dry │ └─────────────────────────────────────────────┘

Bulkhead Implementation

Thread pool isolation: ┌────────────────────────────────────────────────────────┐ │ Application │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Service A Pool │ │ Service B Pool │ │ │ │ [10 threads] │ │ [10 threads] │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ ▼ ▼ │ │ Service A Service B │ │ (slow) (healthy) │ └────────────────────────────────────────────────────────┘

Service A being slow doesn't exhaust threads for Service B

Semaphore isolation:

  • Limit concurrent requests per dependency
  • Lighter weight than thread pools
  • Good for async operations
  1. Timeout Pattern

What: Limit how long to wait for operations When: Always (every external call needs a timeout)

Without timeout: Request → Service hangs → Caller waits forever → Resources exhausted

With timeout: Request → Service hangs → Timeout after 5s → Caller handles failure

Timeout types:

  • Connection timeout: Time to establish connection
  • Read timeout: Time to receive response
  • Overall timeout: Total time allowed

Timeout Best Practices

Setting timeouts:

  • Connection: 1-5 seconds (fast fail)
  • Read: Based on p99 latency + buffer
  • Overall: Sum of connection + read + processing

Example: Connection timeout: 2s Read timeout: 10s (p99 is 5s, 2x buffer) Overall timeout: 15s

Cascade consideration: If A calls B calls C:

  • C timeout < B timeout < A timeout
  • Each layer has buffer for retries
  1. Fallback Pattern

What: Provide alternative when primary fails When: There's a degraded but acceptable alternative

Fallback options: ┌────────────────────────────────────────────────────────┐ │ Primary fails? Options: │ │ │ │ 1. Cached data: Return stale but valid data │ │ 2. Default value: Return safe default │ │ 3. Degraded service: Reduced functionality │ │ 4. Alternative service: Different provider │ │ 5. Graceful error: Friendly error message │ └────────────────────────────────────────────────────────┘

Example: Primary: Real-time price service Fallback 1: Cached price (< 5 min old) Fallback 2: Last known price with warning Fallback 3: "Price temporarily unavailable"

  1. Rate Limiting Pattern

What: Control the rate of requests When: Protect services from overload

Client-side rate limiting:

  • Limit outgoing requests
  • Prevent overwhelming dependencies

Server-side rate limiting:

  • Limit incoming requests
  • Protect from traffic spikes

See: rate-limiting-patterns skill for details

Pattern Combinations

Typical Resilience Stack

Request Flow: ┌─────────────────────────────────────────────────────────┐ │ │ │ ┌────────────┐ │ │ │ Timeout │ ← Overall request timeout │ │ │ ┌────────┐ │ │ │ │ │ Retry │ │ ← With exponential backoff │ │ │ │┌──────┐│ │ │ │ │ ││Circuit││ ← Fail fast if service down │ │ │ ││Breaker│ │ │ │ │ │└──────┘│ │ │ │ │ │┌──────┐│ │ │ │ │ ││Bulkhead│ ← Isolate from other calls │ │ │ │└──────┘│ │ │ │ │ └────────┘ │ │ │ └────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Service │ │ │ └──────────┘ │ │ │ │ │ Failure?──────► Fallback │ │ │ └─────────────────────────────────────────────────────────┘

Order of Application

Outer to inner:

  1. Timeout: Overall time limit
  2. Retry: Attempt recovery from transient failures
  3. Circuit Breaker: Stop calling failed services
  4. Bulkhead: Isolate this call from others
  5. [Call service]
  6. Fallback: Handle failures gracefully

Load Shedding

What is Load Shedding?

When system is overloaded:

  • Accept what you can handle
  • Reject the rest gracefully
  • Better to serve some users well than all users poorly

Priority-based shedding:

  • High priority: Never shed
  • Medium: Shed under moderate load
  • Low: Shed first

Implementation

Approaches:

  1. Queue-based:

    • Fixed-size queue
    • Reject when queue full
    • Serve based on priority
  2. Rate-based:

    • Maximum requests per second
    • Reject when exceeded
    • Return 503 or 429
  3. Adaptive:

    • Monitor latency/error rate
    • Reduce acceptance as stress increases
    • Recover as system stabilizes

Graceful Degradation

Levels of Degradation

Level 0: Full functionality └── Everything works normally

Level 1: Non-essential features disabled └── Recommendations off, analytics delayed

Level 2: Reduced functionality └── Read-only mode, cached data only

Level 3: Minimal functionality └── Core features only, no personalization

Level 4: Maintenance mode └── Static page, "be back soon"

Transition: Automatic based on system health metrics or manual via feature flags

Feature Degradation Examples

E-commerce during high load:

Full feature:

  • Real-time inventory
  • Personalized recommendations
  • Live chat support
  • Detailed analytics

Degraded:

  • Cached inventory (5 min delay)
  • Generic recommendations
  • Contact form only
  • Analytics queued

Minimal:

  • Static "in stock" status
  • No recommendations
  • Email support only
  • Analytics dropped

Health Checks

Types of Health Checks

  1. Liveness check: "Is the process alive?"

    • Simple ping
    • Returns 200 if running
    • Used for restart decisions
  2. Readiness check: "Can it handle traffic?"

    • Checks dependencies
    • Returns 200 if ready
    • Used for load balancer
  3. Deep health check: "Is everything working?"

    • Comprehensive checks
    • May be slower
    • Used for monitoring/debugging

Health Check Best Practices

Do:

  • Keep liveness checks simple and fast
  • Check all critical dependencies in readiness
  • Include version/build info in response
  • Return appropriate status codes

Don't:

  • Block liveness on dependencies
  • Include heavy operations in health checks
  • Expose sensitive information
  • Forget to handle dependency timeouts

Testing Resilience

How to Test

  1. Unit tests:

    • Test each pattern in isolation
    • Mock failures
    • Verify behavior
  2. Integration tests:

    • Test pattern combinations
    • Inject failures
    • Verify recovery
  3. Chaos engineering:

    • Test in production-like environment
    • Random failures
    • Verify system behavior

See: chaos-engineering-fundamentals skill

Implementation Considerations

Library vs Custom

Libraries (recommended):

  • Polly (.NET)
  • Resilience4j (Java)
  • Hystrix (Java, deprecated)
  • go-resilience (Go)

Benefits:

  • Battle-tested
  • Well-documented
  • Community support
  • Metrics built-in

Custom implementation:

  • Only when specific needs
  • High maintenance burden
  • Risk of subtle bugs

Monitoring Resilience

Metrics to track:

Circuit Breaker:

  • State changes
  • Open duration
  • Failure rate

Retries:

  • Retry count
  • Retry success rate
  • Final success/failure

Bulkhead:

  • Concurrent calls
  • Rejections
  • Queue depth

Timeouts:

  • Timeout count
  • Latency distribution

Best Practices

  1. Every external call needs a timeout No call should wait forever

  2. Retry only transient failures Don't retry 400 errors

  3. Circuit breaker per dependency Different services need different protection

  4. Bulkhead critical paths Isolate important from less important

  5. Plan fallbacks Know what to do when things fail

  6. Monitor everything Can't fix what you can't see

  7. Test failure paths Happy path tests aren't enough

Related Skills

  • chaos-engineering-fundamentals

  • Testing resilience

  • distributed-transactions

  • Handling failures in transactions

  • rate-limiting-patterns

  • Controlling request rates

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

design-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

plantuml-syntax

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

system-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review