Performance Monitor Skill

Overview

This skill provides comprehensive real-time performance monitoring capabilities including metrics collection, bottleneck detection, SLA compliance tracking, anomaly detection, and resource utilization monitoring for swarm-based systems.

When to Use

Continuous monitoring of swarm performance
Detecting performance bottlenecks before they impact operations
Tracking SLA compliance and generating alerts
Anomaly detection in system metrics
Resource utilization tracking and forecasting
Building real-time performance dashboards

Quick Start

Start comprehensive monitoring

npx claude-flow performance-report --format detailed --timeframe 24h

Real-time bottleneck analysis

npx claude-flow bottleneck-analyze --component swarm-coordination

Health check all components

npx claude-flow health-check --components ["swarm", "agents", "coordination"]

Collect specific metrics

npx claude-flow metrics-collect --components ["cpu", "memory", "network"]

Architecture

Core Capabilities

Multi-Dimensional Metrics Collection

// Real-time metrics collection const metrics = await mcp.metrics_collect({ components: ['cpu', 'memory', 'network', 'agents'] });

// System metrics include: // - CPU: usage, load average, core utilization // - Memory: usage, available, pressure // - I/O: disk usage, disk I/O, network I/O // - Processes: count, threads, handles

Bottleneck Detection

Detects and categorizes bottlenecks:

CPU Bottlenecks: High CPU usage, core saturation
Memory Bottlenecks: Memory pressure, leak detection
I/O Bottlenecks: Disk saturation, network congestion
Coordination Bottlenecks: Agent communication delays
Task Queue Bottlenecks: Queue backup, processing delays

Analyze specific component

npx claude-flow bottleneck-analyze --component task-queue

Full system analysis

npx claude-flow bottleneck-analyze

SLA Monitoring

Configure and monitor SLA metrics:

Metric Description Typical Threshold

Availability System uptime percentage 99.9%

Response Time Request latency < 1000ms

Throughput Requests per second

100 RPS

Error Rate Failed requests percentage < 0.1%

Recovery Time Time to recover from failure < 300s

Anomaly Detection

Multi-model anomaly detection:

Statistical: 3-sigma rule deviation detection
Machine Learning: Trained anomaly detection models
Time Series: LSTM-based temporal anomaly detection
Behavioral: Agent behavior pattern analysis

Key Metrics

KPIs Monitored

Category Metrics

Availability Uptime, MTBF, MTTR

Performance Response time (p50/p90/p95/p99), throughput

Efficiency Resource utilization, cost per transaction

Reliability Error rate, success rate, fault tolerance

Resource Tracking

CPU: Current, peak, average utilization with percentiles
Memory: Usage trends, leak detection, pressure indicators
Network: Bandwidth utilization, latency, packet loss
Agents: Per-agent efficiency, responsiveness, reliability

MCP Integration

// Comprehensive monitoring setup const monitoring = { // Start all monitors async startMonitoring() { const [health, performance, bottlenecks] = await Promise.all([ mcp.health_check({ components: ['swarm', 'coordination'] }), mcp.performance_report({ format: 'detailed', timeframe: '24h' }), mcp.bottleneck_analyze({}) ]);

return { health, performance, bottlenecks };

// Agent performance tracking async monitorAgents(swarmId) { const agents = await mcp.agent_list({ swarmId }); const metrics = new Map();

for (const agent of agents) {
  metrics.set(agent.id, await mcp.agent_metrics({ agentId: agent.id }));
}

return metrics;

} };

Alert Configuration

Configure performance alerts

npx claude-flow alert-config --metric cpu_usage --threshold 80 --severity warning

Set up anomaly detection

npx claude-flow anomaly-setup --models ["statistical", "ml", "time_series"]

Configure notification channels

npx claude-flow notification-config --channels ["slack", "email", "webhook"]

Integration Points

Integration Purpose

Load Balancer Provides performance data for load balancing decisions

Topology Optimizer Supplies network and coordination metrics

Resource Allocator Shares resource utilization and forecasting data

Task Orchestrator Monitors task execution performance

Best Practices

Baseline Establishment: Collect baseline metrics before monitoring for anomalies
Alert Tuning: Start with conservative thresholds, tune based on false positive rate
Multi-Layer Monitoring: Monitor at system, agent, and task levels
Historical Analysis: Retain metrics for trend analysis and capacity planning
Proactive Detection: Use predictive analytics to detect issues before impact

Example: Dashboard Data Provider

// Real-time dashboard data const dashboardData = { overview: { swarmHealth: 'healthy', activeAgents: 12, totalTasks: 1547, averageResponseTime: 45 // ms }, performance: { throughput: 250, // tasks/sec latency: { p50: 40, p90: 85, p99: 120 }, // ms errorRate: 0.02, // percentage utilization: 0.72 // percentage }, alerts: [], timestamp: Date.now() };

Related Skills

optimization-benchmark
Comprehensive performance benchmarking
optimization-load-balancer
Dynamic load distribution
optimization-resources
Resource allocation and scaling
optimization-topology
Network topology optimization

Version History

1.0.0 (2026-01-02): Initial release - converted from performance-monitor agent with metrics collection, bottleneck detection, SLA monitoring, anomaly detection, and dashboard integration

optimization-monitor

Safety Notice

Copy this and send it to your AI assistant to learn