Retry Policy Designer
Design retry policies that actually work in production — not the naive "retry 3 times" that causes cascading failures. Analyze failure modes, calculate backoff curves, integrate circuit breakers, set timeout budgets, and generate production-ready retry configurations.
Use when: "design retry policy", "how should we handle retries", "exponential backoff", "retry strategy", "timeout design", "cascading failure prevention", "how many retries", or when adding resilience to service-to-service communication.
Commands
1. design — Create Retry Policy for a Service
Step 1: Classify the Operation
Ask about the operation to design retries for:
| Property | Options | Impact on retry design |
|---|---|---|
| Idempotent? | Yes/No | Non-idempotent = max 0-1 retries |
| Latency tolerance | Real-time / Batch | Determines total timeout budget |
| Failure mode | Transient / Persistent | Transient = retry, persistent = fail fast |
| Downstream | Internal / External | External = more conservative retries |
| Impact of duplicate | None / Data corruption | Determines retry safety |
Step 2: Calculate Retry Parameters
import math
def design_retry_policy(
timeout_budget_ms: int, # Total time we can afford
base_delay_ms: int = 100, # Initial backoff
max_delay_ms: int = 30000, # Cap on single retry delay
multiplier: float = 2.0, # Exponential multiplier
jitter_factor: float = 0.5, # Random jitter (0-1)
):
"""Calculate how many retries fit in the timeout budget"""
total_delay = 0
retries = 0
while True:
delay = min(base_delay_ms * (multiplier ** retries), max_delay_ms)
# Add expected jitter (half of max jitter on average)
avg_jitter = delay * jitter_factor * 0.5
total_delay += delay + avg_jitter
if total_delay > timeout_budget_ms:
break
retries += 1
return {
"max_retries": retries,
"base_delay_ms": base_delay_ms,
"max_delay_ms": max_delay_ms,
"multiplier": multiplier,
"jitter": f"±{jitter_factor*100:.0f}%",
"total_budget_ms": timeout_budget_ms,
"estimated_total_delay_ms": int(total_delay),
"retry_delays": [
f"{min(base_delay_ms * (multiplier ** i), max_delay_ms):.0f}ms"
for i in range(retries)
]
}
# Example: API call with 30s budget
policy = design_retry_policy(timeout_budget_ms=30000)
print(f"Max retries: {policy['max_retries']}")
print(f"Delays: {' → '.join(policy['retry_delays'])}")
Step 3: Determine What to Retry
Always retry:
- HTTP 429 (Too Many Requests) — respect Retry-After header
- HTTP 502, 503, 504 (server/gateway errors)
- Connection refused, reset, timeout
- DNS resolution failures (transient)
Never retry:
- HTTP 400 (Bad Request) — fix the request
- HTTP 401, 403 (Auth errors) — re-authenticate, don't retry
- HTTP 404 (Not Found) — resource doesn't exist
- HTTP 409 (Conflict) — resolve conflict, don't repeat
- HTTP 422 (Validation error) — fix input data
Retry with caution:
- HTTP 500 (Internal Server Error) — may be transient or persistent
- Timeout — only if operation is idempotent
Step 4: Generate Configuration
# Retry Policy: [Service Name]
retry:
max_attempts: 4 # 1 initial + 3 retries
backoff:
type: exponential
initial_interval: 200ms
max_interval: 10s
multiplier: 2.0
randomization_factor: 0.5 # ±50% jitter
timeout_budget: 30s # Total time for all attempts
retryable_status_codes:
- 429
- 502
- 503
- 504
non_retryable_status_codes:
- 400
- 401
- 403
- 404
- 409
- 422
circuit_breaker:
threshold: 5 # 5 failures to open
reset_timeout: 30s # Time before half-open
Language-specific examples:
Go:
retryPolicy := retry.NewExponentialBackoff(
retry.WithInitialInterval(200 * time.Millisecond),
retry.WithMaxInterval(10 * time.Second),
retry.WithMultiplier(2.0),
retry.WithRandomizationFactor(0.5),
retry.WithMaxElapsedTime(30 * time.Second),
)
Python:
from tenacity import retry, stop_after_delay, wait_exponential_jitter
@retry(
stop=stop_after_delay(30),
wait=wait_exponential_jitter(initial=0.2, max=10, jitter=5),
retry=retry_if_exception_type((ConnectionError, TimeoutError)),
)
def call_service():
...
JavaScript:
const retry = require('async-retry');
await retry(async () => {
return await fetch(url);
}, {
retries: 3,
factor: 2,
minTimeout: 200,
maxTimeout: 10000,
randomize: true,
});
2. analyze — Audit Existing Retry Logic
Scan codebase for retry patterns and flag issues:
# Find retry implementations
rg "retry|backoff|exponential|attempt.*max|max.*retry" \
--type-not binary -g '!node_modules' -g '!vendor' 2>/dev/null
Common anti-patterns to flag:
- No jitter → thundering herd on recovery
- No max delay cap → individual retries too slow
- No timeout budget → total time unbounded
- Retrying non-idempotent operations → data corruption risk
- Same retry policy for all endpoints → over/under-retrying
- No circuit breaker → retries pile up during outage
3. simulate — Visualize Retry Behavior
Show the timeline of retry attempts with delays:
Attempt 1: t=0ms → FAIL (503)
[wait 200ms ±50%]
Attempt 2: t=180ms → FAIL (503)
[wait 400ms ±50%]
Attempt 3: t=520ms → FAIL (503)
[wait 800ms ±50%]
Attempt 4: t=1180ms → SUCCESS (200)
Total time: 1.2s