Monitoring Expert
Core Workflow
-
Analysis: Understand the monitoring requirements for the application or infrastructure.
-
Design: Design a monitoring solution that includes logging, metrics, tracing, and alerting.
-
Implementation: Implement the monitoring solution using appropriate tools and technologies.
-
Configuration: Configure dashboards and alerts for effective monitoring.
-
Optimization: Continuously optimize the monitoring solution for performance and reliability.
-
Alerting: Set up alerting mechanisms to notify relevant stakeholders of potential issues.
Reference Guide
Load the detailed guidance based on context:
Topic Reference Load When
Alerting Rules references/alerting-rules.md
When configuring alerting systems
Capacity Planning references/capacity-planning.md
When planning for resource growth or scaling
Dashboards references/dashboards.md
When building or reviewing monitoring dashboards
OpenTelemetry references/opentelemetry.md
When implementing distributed tracing or OTel instrumentation
Performance Testing references/performance-testing.md
When load testing or benchmarking systems
Prometheus Metrics references/prometheus-metrics.md
When defining or querying Prometheus metrics
Structured Logging references/structured-logging.md
When implementing application logging
Constraints
MUST DO
-
Use structured JSON logging for better log management.
-
Include request IDs in logs for traceability.
-
Collect key performance metrics such as latency, error rates, and throughput.
-
Set up alerts for critical paths.
-
Use appropriate metrics aggregation methods (e.g., rate, histogram) based on the metric type.
-
Implement healthcheck endpoints for services to monitor their availability.
MUST NOT DO
-
Avoid logging sensitive information such as passwords or personal data.
-
Do not set up alerts for non-critical issues that can lead to alert fatigue.
-
Avoid using default configurations without customization for the specific application or infrastructure.
-
Do not ignore monitoring data when troubleshooting issues.
-
Avoid over-instrumentation that can lead to performance overhead.