Retry Policy Designer

Design retry policies that actually work in production — not the naive "retry 3 times" that causes cascading failures. Analyze failure modes, calculate backoff curves, integrate circuit breakers, set timeout budgets, and generate production-ready retry configurations.

Use when: "design retry policy", "how should we handle retries", "exponential backoff", "retry strategy", "timeout design", "cascading failure prevention", "how many retries", or when adding resilience to service-to-service communication.

Commands

1. `design` — Create Retry Policy for a Service

Step 1: Classify the Operation

Ask about the operation to design retries for:

Property	Options	Impact on retry design
Idempotent?	Yes/No	Non-idempotent = max 0-1 retries
Latency tolerance	Real-time / Batch	Determines total timeout budget
Failure mode	Transient / Persistent	Transient = retry, persistent = fail fast
Downstream	Internal / External	External = more conservative retries
Impact of duplicate	None / Data corruption	Determines retry safety

Step 2: Calculate Retry Parameters

import math

def design_retry_policy(
    timeout_budget_ms: int,      # Total time we can afford
    base_delay_ms: int = 100,    # Initial backoff
    max_delay_ms: int = 30000,   # Cap on single retry delay
    multiplier: float = 2.0,     # Exponential multiplier
    jitter_factor: float = 0.5,  # Random jitter (0-1)
):
    """Calculate how many retries fit in the timeout budget"""
    total_delay = 0
    retries = 0

    while True:
        delay = min(base_delay_ms * (multiplier ** retries), max_delay_ms)
        # Add expected jitter (half of max jitter on average)
        avg_jitter = delay * jitter_factor * 0.5
        total_delay += delay + avg_jitter

        if total_delay > timeout_budget_ms:
            break
        retries += 1

    return {
        "max_retries": retries,
        "base_delay_ms": base_delay_ms,
        "max_delay_ms": max_delay_ms,
        "multiplier": multiplier,
        "jitter": f"±{jitter_factor*100:.0f}%",
        "total_budget_ms": timeout_budget_ms,
        "estimated_total_delay_ms": int(total_delay),
        "retry_delays": [
            f"{min(base_delay_ms * (multiplier ** i), max_delay_ms):.0f}ms"
            for i in range(retries)
        ]
    }

# Example: API call with 30s budget
policy = design_retry_policy(timeout_budget_ms=30000)
print(f"Max retries: {policy['max_retries']}")
print(f"Delays: {' → '.join(policy['retry_delays'])}")

Step 3: Determine What to Retry

Always retry:

HTTP 429 (Too Many Requests) — respect Retry-After header
HTTP 502, 503, 504 (server/gateway errors)
Connection refused, reset, timeout
DNS resolution failures (transient)

Never retry:

HTTP 400 (Bad Request) — fix the request
HTTP 401, 403 (Auth errors) — re-authenticate, don't retry
HTTP 404 (Not Found) — resource doesn't exist
HTTP 409 (Conflict) — resolve conflict, don't repeat
HTTP 422 (Validation error) — fix input data

Retry with caution:

HTTP 500 (Internal Server Error) — may be transient or persistent
Timeout — only if operation is idempotent

Step 4: Generate Configuration

# Retry Policy: [Service Name]
retry:
  max_attempts: 4          # 1 initial + 3 retries
  backoff:
    type: exponential
    initial_interval: 200ms
    max_interval: 10s
    multiplier: 2.0
    randomization_factor: 0.5  # ±50% jitter
  timeout_budget: 30s      # Total time for all attempts
  retryable_status_codes:
    - 429
    - 502
    - 503
    - 504
  non_retryable_status_codes:
    - 400
    - 401
    - 403
    - 404
    - 409
    - 422
  circuit_breaker:
    threshold: 5            # 5 failures to open
    reset_timeout: 30s      # Time before half-open

Language-specific examples:

Go:

retryPolicy := retry.NewExponentialBackoff(
    retry.WithInitialInterval(200 * time.Millisecond),
    retry.WithMaxInterval(10 * time.Second),
    retry.WithMultiplier(2.0),
    retry.WithRandomizationFactor(0.5),
    retry.WithMaxElapsedTime(30 * time.Second),
)

Python:

from tenacity import retry, stop_after_delay, wait_exponential_jitter
@retry(
    stop=stop_after_delay(30),
    wait=wait_exponential_jitter(initial=0.2, max=10, jitter=5),
    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
)
def call_service():
    ...

JavaScript:

const retry = require('async-retry');
await retry(async () => {
    return await fetch(url);
}, {
    retries: 3,
    factor: 2,
    minTimeout: 200,
    maxTimeout: 10000,
    randomize: true,
});

2. `analyze` — Audit Existing Retry Logic

Scan codebase for retry patterns and flag issues:

# Find retry implementations
rg "retry|backoff|exponential|attempt.*max|max.*retry" \
  --type-not binary -g '!node_modules' -g '!vendor' 2>/dev/null

Common anti-patterns to flag:

No jitter → thundering herd on recovery
No max delay cap → individual retries too slow
No timeout budget → total time unbounded
Retrying non-idempotent operations → data corruption risk
Same retry policy for all endpoints → over/under-retrying
No circuit breaker → retries pile up during outage

3. `simulate` — Visualize Retry Behavior

Show the timeline of retry attempts with delays:

Attempt 1: t=0ms        → FAIL (503)
  [wait 200ms ±50%]
Attempt 2: t=180ms      → FAIL (503)
  [wait 400ms ±50%]
Attempt 3: t=520ms      → FAIL (503)
  [wait 800ms ±50%]
Attempt 4: t=1180ms     → SUCCESS (200)
Total time: 1.2s