Benchmark Suite Skill

Overview

This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance.

When to Use

Running performance benchmark suites before deployment
Detecting performance regressions between versions
Validating SLA compliance through automated testing
Load, stress, and endurance testing
CI/CD pipeline performance quality gates
Comparing performance across configurations

Quick Start

Run comprehensive benchmark suite

npx claude-flow benchmark-run --suite comprehensive --duration 300

Execute specific benchmark

npx claude-flow benchmark-run --suite throughput --iterations 10

Compare with baseline

npx claude-flow benchmark-compare --current <results> --baseline <baseline>

Quality assessment

npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency

Performance validation

npx claude-flow validate-performance --results <file> --criteria <file>

Architecture

Benchmark Types

Standard Benchmarks

Benchmark Metrics Duration Targets

Throughput requests/sec, tasks/sec, messages/sec 5 min min: 1000, optimal: 5000

Latency p50, p90, p95, p99, max 5 min p50<100ms, p99<1s

Scalability linear coefficient, efficiency retention variable coefficient>0.8

Coordination message latency, sync time 5 min <50ms

Fault Tolerance recovery time, failover success 10 min <30s recovery

Test Campaign Types

Load Testing: Gradual ramp-up to sustained load
Stress Testing: Find breaking points
Volume Testing: Large data set handling
Endurance Testing: Long-duration stability
Spike Testing: Sudden load changes
Configuration Testing: Different settings comparison

Core Capabilities

Comprehensive Benchmarking

// Run benchmark suite const results = await benchmarkSuite.run({ duration: 300000, // 5 minutes iterations: 10, // 10 iterations warmupTime: 30000, // 30 seconds warmup cooldownTime: 10000, // 10 seconds cooldown parallel: false, // Sequential execution baseline: previousRun // Compare with baseline });

// Results include: // - summary: Overall scores and status // - detailed: Per-benchmark results // - baseline_comparison: Delta from baseline // - recommendations: Optimization suggestions

Regression Detection

Multi-algorithm detection:

Method Description Use Case

Statistical CUSUM change point detection Detect gradual degradation

Machine Learning Anomaly detection models Identify unusual patterns

Threshold Fixed limit comparisons Hard performance limits

Trend Time series regression Long-term degradation

Detect performance regressions

npx claude-flow detect-regression --current <results> --historical <data>

Set up automated regression monitoring

npx claude-flow regression-monitor --enable --sensitivity 0.95

Automated Performance Testing

// Execute test campaign const campaign = await tester.runTestCampaign({ tests: [ { type: 'load', config: loadTestConfig }, { type: 'stress', config: stressTestConfig }, { type: 'endurance', config: enduranceConfig } ], constraints: { maxDuration: 3600000, // 1 hour max failFast: true // Stop on first failure } });

Performance Validation

Validation framework with multi-criteria assessment:

Validation Type Criteria

SLA Validation Availability, response time, throughput, error rate

Regression Validation Comparison with historical data

Scalability Validation Linear scaling, efficiency retention

Reliability Validation Error handling, recovery, consistency

MCP Integration

// Comprehensive benchmark integration const benchmarkIntegration = { // Execute performance benchmarks async runBenchmarks(config = {}) { const [benchmark, metrics, trends, cost] = await Promise.all([ mcp.benchmark_run({ suite: config.suite || 'comprehensive' }), mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }), mcp.trend_analysis({ metric: 'performance', period: '24h' }), mcp.cost_analysis({ timeframe: '24h' }) ]);

return { benchmark, metrics, trends, cost, timestamp: Date.now() };

// Quality assessment async assessQuality(criteria) { return await mcp.quality_assess({ target: 'swarm-performance', criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability'] }); } };

Key Metrics

Benchmark Targets

const benchmarkTargets = { throughput: { requests_per_second: { min: 1000, optimal: 5000 }, tasks_per_second: { min: 100, optimal: 500 }, messages_per_second: { min: 10000, optimal: 50000 } }, latency: { p50: { max: 100 }, // 100ms p90: { max: 200 }, // 200ms p95: { max: 500 }, // 500ms p99: { max: 1000 }, // 1s max: { max: 5000 } // 5s }, scalability: { linear_coefficient: { min: 0.8 }, efficiency_retention: { min: 0.7 } } };

CI/CD Quality Gates

Gate Criteria Action on Failure

Performance < 10% degradation Block deployment

Latency p99 < 1s Warning

Error Rate < 0.5% Block deployment

Scalability

80% linear Warning

Load Testing Example

// Load test with gradual ramp-up const loadTest = { type: 'load', phases: [ { phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 }, { phase: 'sustained', duration: 300000, load: 100 }, { phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 } ], successCriteria: { p99_latency: { max: 1000 }, error_rate: { max: 0.01 }, throughput: { min: 80 } // % of expected } };

Stress Testing Example

// Stress test to find breaking point const stressTest = { type: 'stress', startLoad: 100, maxLoad: 10000, loadIncrement: 100, duration: 60000, // Per load level breakingCriteria: { error_rate: { max: 0.05 }, // 5% errors latency_p99: { max: 5000 }, // 5s latency timeout_rate: { max: 0.10 } // 10% timeouts } };

Integration Points

Integration Purpose

Performance Monitor Continuous monitoring data for benchmarking

Load Balancer Validates load balancing effectiveness

Topology Optimizer Tests topology configurations

CI/CD Pipeline Automated quality gates

Best Practices

Consistent Environment: Run benchmarks in consistent, isolated environments
Warmup Period: Always include warmup to eliminate cold-start effects
Multiple Iterations: Run multiple iterations for statistical significance
Baseline Maintenance: Keep baseline updated with expected performance
Historical Tracking: Store all benchmark results for trend analysis
Realistic Workloads: Use production-like workload patterns

Example: CI/CD Integration

#!/bin/bash

ci-performance-gate.sh

Run benchmark suite

RESULTS=$(npx claude-flow benchmark-run --suite quick --output json)

Compare with baseline

COMPARISON=$(npx claude-flow benchmark-compare
--current "$RESULTS"
--baseline ./baseline.json)

Check for regressions

if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then echo "Performance regression detected!" echo "$COMPARISON" | jq '.regressions' exit 1 fi

echo "Performance validation passed" exit 0

Related Skills

optimization-monitor
Real-time performance monitoring
optimization-analyzer
Bottleneck analysis and reporting
optimization-load-balancer
Load distribution optimization
optimization-topology
Topology performance testing

Version History

1.0.0 (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation

optimization-benchmark

Safety Notice

Copy this and send it to your AI assistant to learn

Run comprehensive benchmark suite

Execute specific benchmark

Compare with baseline

Quality assessment

Performance validation

Detect performance regressions

Set up automated regression monitoring

ci-performance-gate.sh

Run benchmark suite

Compare with baseline

Check for regressions

Source Transparency

Related Skills

echarts

pandoc

mkdocs