retry-policy-designer

Design retry and backoff policies for distributed systems. Analyze failure modes, recommend exponential backoff with jitter, circuit breaker integration, timeout budgets, and generate ready-to-use retry configurations for HTTP clients, message queues, and RPC calls.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "retry-policy-designer" with this command: npx skills add charlie-morrison/retry-policy-designer

Retry Policy Designer

Design retry policies that actually work in production — not the naive "retry 3 times" that causes cascading failures. Analyze failure modes, calculate backoff curves, integrate circuit breakers, set timeout budgets, and generate production-ready retry configurations.

Use when: "design retry policy", "how should we handle retries", "exponential backoff", "retry strategy", "timeout design", "cascading failure prevention", "how many retries", or when adding resilience to service-to-service communication.

Commands

1. design — Create Retry Policy for a Service

Step 1: Classify the Operation

Ask about the operation to design retries for:

PropertyOptionsImpact on retry design
Idempotent?Yes/NoNon-idempotent = max 0-1 retries
Latency toleranceReal-time / BatchDetermines total timeout budget
Failure modeTransient / PersistentTransient = retry, persistent = fail fast
DownstreamInternal / ExternalExternal = more conservative retries
Impact of duplicateNone / Data corruptionDetermines retry safety

Step 2: Calculate Retry Parameters

import math

def design_retry_policy(
    timeout_budget_ms: int,      # Total time we can afford
    base_delay_ms: int = 100,    # Initial backoff
    max_delay_ms: int = 30000,   # Cap on single retry delay
    multiplier: float = 2.0,     # Exponential multiplier
    jitter_factor: float = 0.5,  # Random jitter (0-1)
):
    """Calculate how many retries fit in the timeout budget"""
    total_delay = 0
    retries = 0

    while True:
        delay = min(base_delay_ms * (multiplier ** retries), max_delay_ms)
        # Add expected jitter (half of max jitter on average)
        avg_jitter = delay * jitter_factor * 0.5
        total_delay += delay + avg_jitter

        if total_delay > timeout_budget_ms:
            break
        retries += 1

    return {
        "max_retries": retries,
        "base_delay_ms": base_delay_ms,
        "max_delay_ms": max_delay_ms,
        "multiplier": multiplier,
        "jitter": f"±{jitter_factor*100:.0f}%",
        "total_budget_ms": timeout_budget_ms,
        "estimated_total_delay_ms": int(total_delay),
        "retry_delays": [
            f"{min(base_delay_ms * (multiplier ** i), max_delay_ms):.0f}ms"
            for i in range(retries)
        ]
    }

# Example: API call with 30s budget
policy = design_retry_policy(timeout_budget_ms=30000)
print(f"Max retries: {policy['max_retries']}")
print(f"Delays: {' → '.join(policy['retry_delays'])}")

Step 3: Determine What to Retry

Always retry:

  • HTTP 429 (Too Many Requests) — respect Retry-After header
  • HTTP 502, 503, 504 (server/gateway errors)
  • Connection refused, reset, timeout
  • DNS resolution failures (transient)

Never retry:

  • HTTP 400 (Bad Request) — fix the request
  • HTTP 401, 403 (Auth errors) — re-authenticate, don't retry
  • HTTP 404 (Not Found) — resource doesn't exist
  • HTTP 409 (Conflict) — resolve conflict, don't repeat
  • HTTP 422 (Validation error) — fix input data

Retry with caution:

  • HTTP 500 (Internal Server Error) — may be transient or persistent
  • Timeout — only if operation is idempotent

Step 4: Generate Configuration

# Retry Policy: [Service Name]
retry:
  max_attempts: 4          # 1 initial + 3 retries
  backoff:
    type: exponential
    initial_interval: 200ms
    max_interval: 10s
    multiplier: 2.0
    randomization_factor: 0.5  # ±50% jitter
  timeout_budget: 30s      # Total time for all attempts
  retryable_status_codes:
    - 429
    - 502
    - 503
    - 504
  non_retryable_status_codes:
    - 400
    - 401
    - 403
    - 404
    - 409
    - 422
  circuit_breaker:
    threshold: 5            # 5 failures to open
    reset_timeout: 30s      # Time before half-open

Language-specific examples:

Go:

retryPolicy := retry.NewExponentialBackoff(
    retry.WithInitialInterval(200 * time.Millisecond),
    retry.WithMaxInterval(10 * time.Second),
    retry.WithMultiplier(2.0),
    retry.WithRandomizationFactor(0.5),
    retry.WithMaxElapsedTime(30 * time.Second),
)

Python:

from tenacity import retry, stop_after_delay, wait_exponential_jitter
@retry(
    stop=stop_after_delay(30),
    wait=wait_exponential_jitter(initial=0.2, max=10, jitter=5),
    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
)
def call_service():
    ...

JavaScript:

const retry = require('async-retry');
await retry(async () => {
    return await fetch(url);
}, {
    retries: 3,
    factor: 2,
    minTimeout: 200,
    maxTimeout: 10000,
    randomize: true,
});

2. analyze — Audit Existing Retry Logic

Scan codebase for retry patterns and flag issues:

# Find retry implementations
rg "retry|backoff|exponential|attempt.*max|max.*retry" \
  --type-not binary -g '!node_modules' -g '!vendor' 2>/dev/null

Common anti-patterns to flag:

  • No jitter → thundering herd on recovery
  • No max delay cap → individual retries too slow
  • No timeout budget → total time unbounded
  • Retrying non-idempotent operations → data corruption risk
  • Same retry policy for all endpoints → over/under-retrying
  • No circuit breaker → retries pile up during outage

3. simulate — Visualize Retry Behavior

Show the timeline of retry attempts with delays:

Attempt 1: t=0ms        → FAIL (503)
  [wait 200ms ±50%]
Attempt 2: t=180ms      → FAIL (503)
  [wait 400ms ±50%]
Attempt 3: t=520ms      → FAIL (503)
  [wait 800ms ±50%]
Attempt 4: t=1180ms     → SUCCESS (200)
Total time: 1.2s

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

RootCraft Learning System

RootCraft Learning System - An integrated learning methodology combining First Principles Thinking, Taxonomy-Based Classification, Feynman Technique, and Rec...

Registry SourceRecently Updated
General

Amazon Listing Doctor

亚马逊Listing全方位诊断工具 — Rufus适配度评分 + 场景词覆盖分析 + 转化逻辑优化建议。基于行业知识库对Listing进行5维诊断,帮助卖家了解产品在Rufus对话中的可见度,并生成针对性的优化方案。

Registry SourceRecently Updated
General

Amazon Aplus Generator

快速生成符合亚马逊规范的专业A+内容方案,支持多类目多语言及模块智能匹配与合规自检。

Registry SourceRecently Updated
General

Shadow Traffic Tester

Set up and analyze shadow traffic testing to compare new service versions against production without user impact

Registry SourceRecently Updated