Benchmark Suite Skill
Overview
This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance.
When to Use
-
Running performance benchmark suites before deployment
-
Detecting performance regressions between versions
-
Validating SLA compliance through automated testing
-
Load, stress, and endurance testing
-
CI/CD pipeline performance quality gates
-
Comparing performance across configurations
Quick Start
Run comprehensive benchmark suite
npx claude-flow benchmark-run --suite comprehensive --duration 300
Execute specific benchmark
npx claude-flow benchmark-run --suite throughput --iterations 10
Compare with baseline
npx claude-flow benchmark-compare --current <results> --baseline <baseline>
Quality assessment
npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency
Performance validation
npx claude-flow validate-performance --results <file> --criteria <file>
Architecture
+-----------------------------------------------------------+ | Benchmark Suite | +-----------------------------------------------------------+ | Benchmark Runner | Regression Detector | Validator | +--------------------+-----------------------+--------------+ | | | v v v +------------------+ +-------------------+ +--------------+ | Benchmark Types | | Detection Methods | | Validation | | - Throughput | | - Statistical | | - SLA | | - Latency | | - ML-based | | - Regression | | - Scalability | | - Threshold | | - Scalability| | - Coordination | | - Trend Analysis | | - Reliability| +------------------+ +-------------------+ +--------------+ | | | v v v +-----------------------------------------------------------+ | Reporter & Comparator | +-----------------------------------------------------------+
Benchmark Types
Standard Benchmarks
Benchmark Metrics Duration Targets
Throughput requests/sec, tasks/sec, messages/sec 5 min min: 1000, optimal: 5000
Latency p50, p90, p95, p99, max 5 min p50<100ms, p99<1s
Scalability linear coefficient, efficiency retention variable coefficient>0.8
Coordination message latency, sync time 5 min <50ms
Fault Tolerance recovery time, failover success 10 min <30s recovery
Test Campaign Types
-
Load Testing: Gradual ramp-up to sustained load
-
Stress Testing: Find breaking points
-
Volume Testing: Large data set handling
-
Endurance Testing: Long-duration stability
-
Spike Testing: Sudden load changes
-
Configuration Testing: Different settings comparison
Core Capabilities
- Comprehensive Benchmarking
// Run benchmark suite const results = await benchmarkSuite.run({ duration: 300000, // 5 minutes iterations: 10, // 10 iterations warmupTime: 30000, // 30 seconds warmup cooldownTime: 10000, // 10 seconds cooldown parallel: false, // Sequential execution baseline: previousRun // Compare with baseline });
// Results include: // - summary: Overall scores and status // - detailed: Per-benchmark results // - baseline_comparison: Delta from baseline // - recommendations: Optimization suggestions
- Regression Detection
Multi-algorithm detection:
Method Description Use Case
Statistical CUSUM change point detection Detect gradual degradation
Machine Learning Anomaly detection models Identify unusual patterns
Threshold Fixed limit comparisons Hard performance limits
Trend Time series regression Long-term degradation
Detect performance regressions
npx claude-flow detect-regression --current <results> --historical <data>
Set up automated regression monitoring
npx claude-flow regression-monitor --enable --sensitivity 0.95
- Automated Performance Testing
// Execute test campaign const campaign = await tester.runTestCampaign({ tests: [ { type: 'load', config: loadTestConfig }, { type: 'stress', config: stressTestConfig }, { type: 'endurance', config: enduranceConfig } ], constraints: { maxDuration: 3600000, // 1 hour max failFast: true // Stop on first failure } });
- Performance Validation
Validation framework with multi-criteria assessment:
Validation Type Criteria
SLA Validation Availability, response time, throughput, error rate
Regression Validation Comparison with historical data
Scalability Validation Linear scaling, efficiency retention
Reliability Validation Error handling, recovery, consistency
MCP Integration
// Comprehensive benchmark integration const benchmarkIntegration = { // Execute performance benchmarks async runBenchmarks(config = {}) { const [benchmark, metrics, trends, cost] = await Promise.all([ mcp.benchmark_run({ suite: config.suite || 'comprehensive' }), mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }), mcp.trend_analysis({ metric: 'performance', period: '24h' }), mcp.cost_analysis({ timeframe: '24h' }) ]);
return { benchmark, metrics, trends, cost, timestamp: Date.now() };
},
// Quality assessment async assessQuality(criteria) { return await mcp.quality_assess({ target: 'swarm-performance', criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability'] }); } };
Key Metrics
Benchmark Targets
const benchmarkTargets = { throughput: { requests_per_second: { min: 1000, optimal: 5000 }, tasks_per_second: { min: 100, optimal: 500 }, messages_per_second: { min: 10000, optimal: 50000 } }, latency: { p50: { max: 100 }, // 100ms p90: { max: 200 }, // 200ms p95: { max: 500 }, // 500ms p99: { max: 1000 }, // 1s max: { max: 5000 } // 5s }, scalability: { linear_coefficient: { min: 0.8 }, efficiency_retention: { min: 0.7 } } };
CI/CD Quality Gates
Gate Criteria Action on Failure
Performance < 10% degradation Block deployment
Latency p99 < 1s Warning
Error Rate < 0.5% Block deployment
Scalability
80% linear Warning
Load Testing Example
// Load test with gradual ramp-up const loadTest = { type: 'load', phases: [ { phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 }, { phase: 'sustained', duration: 300000, load: 100 }, { phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 } ], successCriteria: { p99_latency: { max: 1000 }, error_rate: { max: 0.01 }, throughput: { min: 80 } // % of expected } };
Stress Testing Example
// Stress test to find breaking point const stressTest = { type: 'stress', startLoad: 100, maxLoad: 10000, loadIncrement: 100, duration: 60000, // Per load level breakingCriteria: { error_rate: { max: 0.05 }, // 5% errors latency_p99: { max: 5000 }, // 5s latency timeout_rate: { max: 0.10 } // 10% timeouts } };
Integration Points
Integration Purpose
Performance Monitor Continuous monitoring data for benchmarking
Load Balancer Validates load balancing effectiveness
Topology Optimizer Tests topology configurations
CI/CD Pipeline Automated quality gates
Best Practices
-
Consistent Environment: Run benchmarks in consistent, isolated environments
-
Warmup Period: Always include warmup to eliminate cold-start effects
-
Multiple Iterations: Run multiple iterations for statistical significance
-
Baseline Maintenance: Keep baseline updated with expected performance
-
Historical Tracking: Store all benchmark results for trend analysis
-
Realistic Workloads: Use production-like workload patterns
Example: CI/CD Integration
#!/bin/bash
ci-performance-gate.sh
Run benchmark suite
RESULTS=$(npx claude-flow benchmark-run --suite quick --output json)
Compare with baseline
COMPARISON=$(npx claude-flow benchmark-compare
--current "$RESULTS"
--baseline ./baseline.json)
Check for regressions
if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then echo "Performance regression detected!" echo "$COMPARISON" | jq '.regressions' exit 1 fi
echo "Performance validation passed" exit 0
Related Skills
-
optimization-monitor
-
Real-time performance monitoring
-
optimization-analyzer
-
Bottleneck analysis and reporting
-
optimization-load-balancer
-
Load distribution optimization
-
optimization-topology
-
Topology performance testing
Version History
- 1.0.0 (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation