CloudWatch Alarm Creator
Эксперт по мониторингу AWS CloudWatch и настройке алармов.
Основные принципы
-
Выбор порогов: Основывайте на исторических данных и бизнес-требованиях
-
Статистические методы: Выбирайте подходящую статистику (Average, Sum, Maximum) по характеристикам метрик
-
Периоды оценки: Баланс между отзывчивостью и подавлением шума
-
Actionable алерты: Каждый аларм должен иметь понятный путь устранения
-
Оптимизация стоимости: Эффективные стратегии для минимизации расходов
EC2 Alarm
{ "AlarmName": "HighCPUUtilization", "MetricName": "CPUUtilization", "Namespace": "AWS/EC2", "Statistic": "Average", "Period": 300, "EvaluationPeriods": 2, "Threshold": 80, "ComparisonOperator": "GreaterThanThreshold", "Dimensions": [ { "Name": "InstanceId", "Value": "i-1234567890abcdef0" } ], "AlarmActions": ["arn:aws:sns:region:account:topic"], "TreatMissingData": "notBreaching" }
ALB Alarm
{ "AlarmName": "HighTargetResponseTime", "MetricName": "TargetResponseTime", "Namespace": "AWS/ApplicationELB", "Statistic": "Average", "Period": 60, "EvaluationPeriods": 3, "DatapointsToAlarm": 2, "Threshold": 1.0, "ComparisonOperator": "GreaterThanThreshold", "Dimensions": [ { "Name": "LoadBalancer", "Value": "app/my-alb/1234567890" } ], "TreatMissingData": "ignore" }
RDS Alarm
{ "AlarmName": "HighDatabaseConnections", "MetricName": "DatabaseConnections", "Namespace": "AWS/RDS", "Statistic": "Average", "Period": 300, "EvaluationPeriods": 2, "Threshold": 100, "ComparisonOperator": "GreaterThanThreshold", "Dimensions": [ { "Name": "DBInstanceIdentifier", "Value": "my-database" } ] }
Terraform Configuration
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" { alarm_name = "ec2-cpu-high-${var.instance_id}" comparison_operator = "GreaterThanThreshold" evaluation_periods = 2 metric_name = "CPUUtilization" namespace = "AWS/EC2" period = 300 statistic = "Average" threshold = 80 alarm_description = "CPU utilization exceeds 80%"
dimensions = { InstanceId = var.instance_id }
alarm_actions = [aws_sns_topic.alerts.arn] ok_actions = [aws_sns_topic.alerts.arn]
tags = { Environment = var.environment ManagedBy = "terraform" } }
resource "aws_cloudwatch_metric_alarm" "custom_metric" { alarm_name = "custom-error-rate" comparison_operator = "GreaterThanThreshold" evaluation_periods = 3 threshold = 5 alarm_description = "Error rate exceeds 5%"
metric_query { id = "error_rate" expression = "errors/requests*100" label = "Error Rate" return_data = true }
metric_query { id = "errors" metric { metric_name = "Errors" namespace = "MyApp" period = 60 stat = "Sum" } }
metric_query { id = "requests" metric { metric_name = "Requests" namespace = "MyApp" period = 60 stat = "Sum" } } }
Composite Alarm
{ "AlarmName": "CompositeSystemHealth", "AlarmRule": "ALARM(HighCPU) AND (ALARM(HighMemory) OR ALARM(HighDisk))", "AlarmActions": ["arn:aws:sns:region:account:critical-alerts"], "AlarmDescription": "System health degraded - multiple metrics breaching" }
Anomaly Detection
{ "AlarmName": "AnomalyDetectionCPU", "MetricName": "CPUUtilization", "Namespace": "AWS/EC2", "ThresholdMetricId": "ad1", "ComparisonOperator": "GreaterThanUpperThreshold", "EvaluationPeriods": 2, "Metrics": [ { "Id": "m1", "MetricStat": { "Metric": { "Namespace": "AWS/EC2", "MetricName": "CPUUtilization", "Dimensions": [{"Name": "InstanceId", "Value": "i-123"}] }, "Period": 300, "Stat": "Average" } }, { "Id": "ad1", "Expression": "ANOMALY_DETECTION_BAND(m1, 2)" } ] }
SNS Integration
resource "aws_sns_topic" "alerts" { name = "cloudwatch-alerts" }
resource "aws_sns_topic_subscription" "email" { topic_arn = aws_sns_topic.alerts.arn protocol = "email" endpoint = "ops-team@example.com" }
resource "aws_sns_topic_subscription" "lambda" { topic_arn = aws_sns_topic.alerts.arn protocol = "lambda" endpoint = aws_lambda_function.alert_handler.arn }
TreatMissingData Options
Значение Описание Использование
notBreaching
Missing = OK Стандартные метрики
breaching
Missing = ALARM Heartbeat мониторинг
ignore
Сохранять текущее ALB метрики
missing
Missing = INSUFFICIENT По умолчанию
Рекомендации по порогам
EC2: CPUUtilization: warning: 70% critical: 85% period: 300s
StatusCheckFailed: threshold: 1 period: 60s
ALB: TargetResponseTime: p95_warning: 500ms p99_critical: 1000ms
HTTPCode_ELB_5XX: threshold: 10 period: 60s
RDS: CPUUtilization: warning: 70% critical: 85%
FreeableMemory: critical: 256MB
DiskQueueDepth: warning: 5 critical: 10
Стоимость оптимизации
-
Консолидируйте алармы через composite alarms
-
Используйте более длинные периоды где возможно
-
Удаляйте неиспользуемые алармы регулярно
-
Группируйте ресурсы через теги
Тестирование алармов
Переключить состояние для тестирования уведомлений
aws cloudwatch set-alarm-state
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
Лучшие практики
-
2 из 3 datapoints — фильтрация временных спайков
-
Percentile-based thresholds — для latency метрик (P95, P99)
-
Multi-level alerts — Warning и Critical уровни
-
Документируйте runbooks — для каждого типа аларма
-
Регулярный аудит — пересматривайте эффективность порогов