reliability-strategy-builder

Reliability Strategy Builder

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "reliability-strategy-builder" with this command: npx skills add monkey1sai/openai-cli/monkey1sai-openai-cli-reliability-strategy-builder

Reliability Strategy Builder

Build resilient systems with proper failure handling and SLOs.

Reliability Patterns

  1. Circuit Breaker

Prevent cascading failures by stopping requests to failing services.

class CircuitBreaker { private state: "closed" | "open" | "half-open" = "closed"; private failureCount = 0; private lastFailureTime?: Date;

async execute<T>(operation: () => Promise<T>): Promise<T> { if (this.state === "open") { if (this.shouldAttemptReset()) { this.state = "half-open"; } else { throw new Error("Circuit breaker is OPEN"); } }

try {
  const result = await operation();
  this.onSuccess();
  return result;
} catch (error) {
  this.onFailure();
  throw error;
}

}

private onSuccess() { this.failureCount = 0; this.state = "closed"; }

private onFailure() { this.failureCount++; this.lastFailureTime = new Date();

if (this.failureCount >= 5) {
  this.state = "open";
}

}

private shouldAttemptReset(): boolean { if (!this.lastFailureTime) return false; const now = Date.now(); const elapsed = now - this.lastFailureTime.getTime(); return elapsed > 60000; // 1 minute } }

  1. Retry with Backoff

Handle transient failures with exponential backoff.

async function retryWithBackoff<T>( operation: () => Promise<T>, maxRetries = 3, baseDelay = 1000 ): Promise<T> { for (let attempt = 0; attempt <= maxRetries; attempt++) { try { return await operation(); } catch (error) { if (attempt === maxRetries) throw error;

  // Exponential backoff: 1s, 2s, 4s
  const delay = baseDelay * Math.pow(2, attempt);
  await sleep(delay);
}

} throw new Error("Max retries exceeded"); }

  1. Fallback Pattern

Provide degraded functionality when primary fails.

async function getUserWithFallback(userId: string): Promise<User> { try { // Try primary database return await primaryDb.users.findById(userId); } catch (error) { logger.warn("Primary DB failed, using cache");

// Fallback to cache
const cached = await cache.get(`user:${userId}`);
if (cached) return cached;

// Final fallback: return minimal user object
return {
  id: userId,
  name: "Unknown User",
  email: "unavailable",
};

} }

  1. Bulkhead Pattern

Isolate failures to prevent resource exhaustion.

class ThreadPool { private pools = new Map<string, Semaphore>();

constructor() { // Separate pools for different operations this.pools.set("critical", new Semaphore(100)); this.pools.set("standard", new Semaphore(50)); this.pools.set("background", new Semaphore(10)); }

async execute(priority: string, operation: () => Promise<any>) { const pool = this.pools.get(priority); await pool.acquire();

try {
  return await operation();
} finally {
  pool.release();
}

} }

SLO Definitions

SLO Template

service: user-api slos:

  • name: Availability description: API should be available for successful requests target: 99.9% measurement: type: ratio success: status_code < 500 total: all_requests window: 30 days

  • name: Latency description: 95% of requests complete within 500ms target: 95% measurement: type: percentile metric: request_duration_ms threshold: 500 percentile: 95 window: 7 days

  • name: Error Rate description: Less than 1% of requests result in errors target: 99% measurement: type: ratio success: status_code < 400 OR status_code IN [401, 403, 404] total: all_requests window: 24 hours

Error Budget

Error Budget = 100% - SLO

Example: SLO: 99.9% availability Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Failure Mode Analysis

ComponentFailure ModeImpactProbabilityDetectionMitigation
DatabaseUnresponsiveHIGHMediumHealth checks every 10sCircuit breaker, read replicas
API GatewayOverloadHIGHLowRequest queue depthRate limiting, auto-scaling
CacheEvictionMEDIUMHighCache hit rateFallback to DB, larger cache
QueueBacked upLOWMediumQueue depth metricAdd workers, DLQ

Reliability Checklist

Infrastructure

  • Load balancer with health checks

  • Multiple availability zones

  • Auto-scaling configured

  • Database replication

  • Regular backups (tested!)

Application

  • Circuit breakers on external calls

  • Retry logic with backoff

  • Timeouts on all I/O

  • Fallback mechanisms

  • Graceful degradation

Monitoring

  • SLO dashboard

  • Error budgets tracked

  • Alerting on SLO violations

  • Latency percentiles (p50, p95, p99)

  • Dependency health checks

Operations

  • Incident response runbook

  • On-call rotation

  • Postmortem template

  • Disaster recovery plan

  • Chaos engineering tests

Incident Response Plan

Severity Levels

SEV1 (Critical): Complete service outage, data loss

  • Response time: <15 minutes
  • Page on-call immediately

SEV2 (High): Partial outage, degraded performance

  • Response time: <1 hour
  • Alert on-call

SEV3 (Medium): Minor issues, workarounds available

  • Response time: <4 hours
  • Create ticket

SEV4 (Low): Cosmetic issues, no user impact

  • Response time: Next business day
  • Backlog

Incident Response Steps

  • Acknowledge: Confirm receipt within SLA

  • Assess: Determine severity and impact

  • Communicate: Update status page

  • Mitigate: Stop the bleeding (rollback, scale, disable)

  • Resolve: Fix root cause

  • Document: Write postmortem

Best Practices

  • Design for failure: Assume components will fail

  • Fail fast: Don't let slow failures cascade

  • Isolate failures: Bulkhead pattern

  • Graceful degradation: Reduce functionality, don't crash

  • Monitor SLOs: Track error budgets

  • Test failure modes: Chaos engineering

  • Document runbooks: Clear incident response

Output Checklist

  • Circuit breakers implemented

  • Retry logic with backoff

  • Fallback mechanisms

  • Bulkhead isolation

  • SLOs defined (availability, latency, errors)

  • Error budgets calculated

  • Failure mode analysis

  • Monitoring dashboard

  • Incident response plan

  • Runbooks documented

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

eslint-prettier-config

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

websocket-realtime-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

webhook-receiver-hardener

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

bruno-collection-generator

No summary provided by upstream source.

Repository SourceNeeds Review