Monitoring - Complete API Reference
Monitor system health, track errors, and receive alerts when issues occur.
Chat Commands
Service Control
/monitor start # Start monitoring /monitor stop # Stop monitoring /monitor status # Check monitoring status
Health Checks
/monitor health # Run health check /monitor health --verbose # Detailed health info /monitor providers # Check LLM provider status
Alerts
/monitor alerts # View recent alerts /monitor alerts --unread # Unread alerts only /monitor alert-targets # View alert destinations /monitor alert-targets add email <addr> # Add email target /monitor alert-targets add webhook <url> # Add webhook target /monitor alert-targets remove <id> # Remove target
Configuration
/monitor config # View config /monitor cooldown 300 # Set alert cooldown (seconds) /monitor threshold cpu 80 # Set CPU alert threshold /monitor threshold memory 90 # Set memory threshold
TypeScript API Reference
Create Monitoring Service
import { createMonitoringService } from 'clodds/monitoring';
const monitor = createMonitoringService({ // Health check interval intervalMs: 60000, // 1 minute
// Alert targets alertTargets: [ { type: 'email', address: 'alerts@example.com' }, { type: 'webhook', url: 'https://hooks.example.com/alerts' }, ],
// Alert cooldown (prevent spam) alertCooldownMs: 300000, // 5 minutes
// Thresholds thresholds: { cpu: 80, // Alert at 80% CPU memory: 90, // Alert at 90% memory errorRate: 10, // Alert at 10% error rate }, });
Start/Stop Monitoring
// Start monitoring await monitor.start();
// Check if running const isRunning = monitor.isRunning();
// Stop monitoring await monitor.stop();
Health Checks
// Run health check const health = await monitor.runHealthCheck();
console.log(Overall: ${health.status}); // 'healthy' | 'degraded' | 'unhealthy'
console.log('\nSystem:');
console.log( CPU: ${health.system.cpu}%);
console.log( Memory: ${health.system.memory}%);
console.log( Disk: ${health.system.disk}%);
console.log('\nProviders:');
for (const [name, status] of Object.entries(health.providers)) {
console.log( ${name}: ${status.status} (${status.latencyMs}ms));
}
console.log('\nServices:');
for (const [name, status] of Object.entries(health.services)) {
console.log( ${name}: ${status.status});
}
Provider Health
// Check LLM provider status const providers = await monitor.checkProviders();
for (const provider of providers) {
console.log(${provider.name}:);
console.log( Status: ${provider.status});
console.log( Latency: ${provider.latencyMs}ms);
console.log( Last error: ${provider.lastError || 'none'});
console.log( Error rate: ${provider.errorRate}%);
}
Alert Management
// Get recent alerts const alerts = await monitor.getAlerts({ limit: 10 });
for (const alert of alerts) {
console.log([${alert.severity}] ${alert.title});
console.log( ${alert.message});
console.log( Time: ${alert.timestamp});
console.log( Acknowledged: ${alert.acknowledged});
}
// Acknowledge alert await monitor.acknowledgeAlert(alertId);
// Get unread count const unread = await monitor.getUnreadAlertCount();
Alert Targets
// Add alert target await monitor.addAlertTarget({ type: 'email', address: 'team@example.com', });
await monitor.addAlertTarget({ type: 'webhook', url: 'https://hooks.slack.com/...', });
// List targets const targets = monitor.getAlertTargets();
// Remove target await monitor.removeAlertTarget(targetId);
Event Handlers
// Listen for events
monitor.on('alert', (alert) => {
console.log(🚨 Alert: ${alert.title});
});
monitor.on('healthCheck', (health) => {
if (health.status !== 'healthy') {
console.log(⚠️ System ${health.status});
}
});
monitor.on('providerDown', (provider) => {
console.log(❌ Provider down: ${provider.name});
});
monitor.on('providerRecovered', (provider) => {
console.log(✅ Provider recovered: ${provider.name});
});
Manual Alerts
// Send manual alert await monitor.sendAlert({ severity: 'warning', // 'info' | 'warning' | 'error' | 'critical' title: 'Custom Alert', message: 'Something important happened', metadata: { key: 'value' }, });
Alert Types
Type Trigger
provider_down LLM provider not responding
high_cpu CPU usage above threshold
high_memory Memory usage above threshold
high_error_rate Error rate above threshold
unhandled_exception Uncaught exception
unhandled_rejection Unhandled promise rejection
Configuration
// Update config monitor.configure({ intervalMs: 30000, alertCooldownMs: 600000, thresholds: { cpu: 85, memory: 95, errorRate: 5, }, });
Best Practices
-
Set appropriate thresholds - Avoid alert fatigue
-
Use cooldowns - Prevent alert spam
-
Multiple targets - Email + webhook for redundancy
-
Acknowledge alerts - Track what's been handled
-
Monitor providers - Know when APIs are down
-
Check health regularly - Don't just rely on alerts