testing-chaos

Run chaos engineering tests to build resilient systems

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "testing-chaos" with this command: npx skills add wojons/skills/wojons-skills-testing-chaos

Chaos Testing

Run chaos engineering experiments to build resilient systems by intentionally injecting failures and observing system behavior.

When to use me

Use this skill when:

  • Building highly available and resilient systems
  • Testing failure recovery and auto-remediation
  • Validating disaster recovery plans
  • Ensuring graceful degradation under stress
  • Testing monitoring and alerting systems
  • Building confidence in production resilience
  • Preparing for unexpected failure scenarios

What I do

  • Failure injection experiments:

    • Network latency and packet loss
    • Service dependency failures
    • Resource exhaustion (CPU, memory, disk)
    • Database connection failures
    • Third-party API outages
  • Resilience validation:

    • Circuit breaker pattern testing
    • Retry and backoff strategy validation
    • Fallback and default behavior testing
    • Load shedding and rate limiting
    • Failover and redundancy testing
  • Coordination with other testing:

    • Builds on performance and load testing
    • Complements disaster recovery testing
    • Informs reliability and availability testing
    • Validates monitoring and observability
  • Experiment design:

    • Hypothesis-driven experimentation
    • Blast radius containment
    • Progressive fault injection
    • Automated experiment orchestration

Examples

# Chaos engineering tools
npm run test:chaos:start           # Start chaos experiments
npm run test:chaos:stop            # Stop all chaos experiments
npm run test:chaos:status          # Check experiment status

# Specific failure injections
npm run test:chaos:network         # Network failure injection
npm run test:chaos:service         # Service dependency failures
npm run test:chaos:resource        # Resource exhaustion
npm run test:chaos:database        # Database failures

# Integration with other tests
npm run test:performance -- --chaos # Performance under failure
npm run test:reliability -- --chaos # Reliability with faults

# Experiment scenarios
npm run test:chaos:scenario:api-outage      # API dependency outage
npm run test:chaos:scenario:db-failover     # Database failover
npm run test:chaos:scenario:latency-spike   # Network latency spike
npm run test:chaos:scenario:memory-leak     # Memory pressure

# Safety controls
npm run test:chaos:safety-check    # Pre-experiment safety check
npm run test:chaos:rollback        # Emergency rollback

Output format

Chaos Test Results:
──────────────────────────────
Experiment: Database Primary Node Failure
Hypothesis: System will failover to replica within 30 seconds
Blast Radius: Staging environment, canary deployment
Duration: 15 minutes

Experiment Execution:
  1. Baseline metrics collected
  2. Database primary node terminated (simulated)
  3. System behavior observed for 10 minutes
  4. Metrics compared to baseline

Results:
  ✅ Failover Time: 22 seconds (within 30s target)
  ✅ Data Consistency: No data loss detected
  ✅ User Impact: 15% error rate during failover (acceptable)
  ✅ Recovery: Automatic, no manual intervention required
  ✅ Monitoring: Alerts triggered within 45 seconds

System Behavior Under Failure:
  - API response time increased from 150ms to 850ms during failover
  - Error rate spiked to 15% for 45 seconds
  - Read-only operations continued uninterrupted
  - Write operations queued and retried successfully

Integration with Other Testing:
  - Performance testing: Established baseline for comparison
  - Reliability testing: Validated MTTR (Mean Time To Recovery)
  - Monitoring testing: Alert effectiveness verified
  - Disaster recovery: Automated failover confirmed

Safety Controls:
  - Automatic rollback on critical failure thresholds
  - Manual abort available throughout
  - Canary deployment limited impact
  - Post-experiment verification passed

Lessons Learned:
  1. Need to improve connection pooling during failover
  2. Alert thresholds should be adjusted for transient failures
  3. User-facing error messages during failover need improvement
  4. Database health checks could be more frequent

Recommendation:
  - System demonstrates good resilience to database failures
  - Implement suggested improvements from lessons learned
  - Schedule regular chaos experiments (monthly)
  - Expand blast radius gradually as confidence increases

Notes

  • Start with small, controlled experiments
  • Always have a rollback plan and automatic abort mechanisms
  • Test in staging before production
  • Document hypotheses and validate outcomes
  • Use feature flags to control chaos injection
  • Monitor system metrics closely during experiments
  • Learn from failures and improve system design
  • Chaos testing complements, doesn't replace, other testing
  • Consider business impact and schedule experiments appropriately
  • Build a culture of resilience, not just technical fixes

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

adversarial-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
General

redteam

No summary provided by upstream source.

Repository SourceNeeds Review
General

performance-profiling

No summary provided by upstream source.

Repository SourceNeeds Review