Alerting & On-Call
Configure effective alerting and on-call management for production systems.
When to Use This Skill
Use this skill when:
-
Setting up alerting rules and thresholds
-
Configuring on-call rotations and schedules
-
Implementing alert routing and escalation
-
Reducing alert fatigue
-
Managing incident response workflows
Prerequisites
-
Monitoring system (Prometheus, Datadog, etc.)
-
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
-
Communication channels (Slack, email)
Alerting Best Practices
Alert Categories
Severity levels
critical:
- Service completely down
- Data loss imminent
- Security breach response: Immediate page, wake people up
high:
- Service degraded significantly
- Error rate above SLO
- Capacity near limit response: Page during business hours, notify after hours
medium:
- Performance degradation
- Non-critical component failure
- Warning thresholds exceeded response: Notify via Slack, review next business day
low:
- Informational alerts
- Capacity planning triggers
- Routine maintenance needed response: Email notification, weekly review
Alert Design Principles
Good alert characteristics
alerts: actionable: - Every alert should require human action - Include runbook links - Clear remediation steps
relevant: - Alert on symptoms, not causes - Focus on user impact - Avoid alerting on expected behavior
timely: - Appropriate thresholds - Suitable evaluation windows - Account for normal variance
unique: - No duplicate alerts - Proper alert grouping - Clear ownership
Prometheus Alerting
Alert Rules
prometheus/rules/alerts.yml
groups:
-
name: service_alerts rules:
High-level service health
- alert: ServiceDown expr: up{job="myapp"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute." runbook_url: "https://wiki.example.com/runbooks/service-down"
Error rate alert
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
Latency alert (SLO-based)
- alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 0.5 for: 5m labels: severity: high annotations: summary: "P95 latency above 500ms for {{ $labels.service }}"
Alertmanager Configuration
alertmanager.yml
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
templates:
- '/etc/alertmanager/templates/*.tmpl'
route: receiver: 'default-receiver' group_by: ['alertname', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h
routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' group_wait: 0s repeat_interval: 1h
# High severity during business hours
- match:
severity: high
receiver: 'slack-high'
active_time_intervals:
- business-hours
# Route by team
- match_re:
team: platform.*
receiver: 'platform-team'
receivers:
-
name: 'default-receiver' slack_configs:
- channel: '#alerts' send_resolved: true
-
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'xxx' severity: critical description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ template "pagerduty.firing" . }}'
-
name: 'slack-high' slack_configs:
- channel: '#alerts-high'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
actions:
- type: button text: 'Runbook' url: '{{ .CommonAnnotations.runbook_url }}'
- type: button text: 'Dashboard' url: '{{ .CommonAnnotations.dashboard_url }}'
- channel: '#alerts-high'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
actions:
-
name: 'platform-team' slack_configs:
- channel: '#platform-alerts'
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00' end_time: '17:00'
- weekdays: ['monday:friday']
times:
inhibit_rules:
- source_match: severity: critical target_match: severity: high equal: ['service']
PagerDuty Integration
Service Configuration
Terraform example
resource "pagerduty_service" "myapp" { name = "MyApp Production" description = "Production application service" escalation_policy = pagerduty_escalation_policy.default.id alert_creation = "create_alerts_and_incidents" auto_resolve_timeout = 14400 # 4 hours acknowledgement_timeout = 600 # 10 minutes
incident_urgency_rule { type = "use_support_hours"
during_support_hours {
type = "constant"
urgency = "high"
}
outside_support_hours {
type = "constant"
urgency = "low"
}
} }
resource "pagerduty_escalation_policy" "default" { name = "Default Escalation" num_loops = 2
rule { escalation_delay_in_minutes = 10 target { type = "schedule_reference" id = pagerduty_schedule.primary.id } }
rule { escalation_delay_in_minutes = 15 target { type = "user_reference" id = pagerduty_user.manager.id } } }
Schedule Configuration
resource "pagerduty_schedule" "primary" { name = "Primary On-Call" time_zone = "America/New_York"
layer { name = "Weekly Rotation" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 604800 # 1 week users = [for user in pagerduty_user.oncall : user.id] }
Override layer for holidays
layer { name = "Holiday Coverage" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 86400 users = [pagerduty_user.holiday_coverage.id]
restriction {
type = "daily_restriction"
start_time_of_day = "00:00:00"
duration_seconds = 86400
start_day_of_week = 0 # Sunday
}
} }
Grafana OnCall
Integration Setup
docker-compose.yml addition
services: oncall: image: grafana/oncall environment: - SECRET_KEY=your-secret-key - BASE_URL=http://oncall:8080 - GRAFANA_API_URL=http://grafana:3000 ports: - "8080:8080"
Escalation Chain
Example escalation chain structure
escalation_chains:
- name: "Production Critical"
steps:
-
step: 1 type: notify persons:
- "@oncall-primary" wait_delay: 0
-
step: 2 type: notify persons:
- "@oncall-secondary" wait_delay: 5m
-
step: 3 type: notify persons:
- "@engineering-manager" wait_delay: 10m
-
step: 4 type: trigger_action action: "escalate_to_incident_commander" wait_delay: 15m
-
Alert Templates
Slack Alert Template
{{ define "slack.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}
{{ define "slack.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Description: {{ .Annotations.description }} Runbook: {{ .Annotations.runbook_url }} {{ end }} {{ end }}
PagerDuty Details Template
{{ define "pagerduty.firing" }} {{ range .Alerts.Firing }} Alert: {{ .Labels.alertname }} Service: {{ .Labels.service }} Instance: {{ .Labels.instance }} Value: {{ .Annotations.value }} Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }} {{ end }} {{ end }}
On-Call Best Practices
Rotation Guidelines
on_call_guidelines: rotation_length: 1 week handoff_time: "10:00 AM Monday"
responsibilities: - Monitor alerts during shift - Respond within SLA (critical: 5min, high: 15min) - Document incidents - Handoff unresolved issues
support: - Secondary on-call for backup - Clear escalation path - Manager availability for major incidents
wellness: - Maximum 1 week on-call per month - Comp time after high-alert periods - No-interrupt recovery day after shift
Runbook Template
Alert: High Error Rate
Summary
Error rate has exceeded the threshold of 5% for the service.
Impact
Users may experience errors when accessing the application.
Investigation Steps
- Check service logs:
kubectl logs -l app=myapp -n production - Review recent deployments:
kubectl rollout history deployment/myapp - Check database connectivity:
kubectl exec -it myapp -- nc -zv postgres 5432 - Review error traces in APM dashboard
Remediation
If caused by recent deployment:
kubectl rollout undo deployment/myapp -n production
If database related:
kubectl delete pod -l app=postgres -n production
Escalation
If not resolved within 15 minutes, escalate to:
- Database team: @db-oncall
- Platform team: @platform-oncall
## Alert Fatigue Reduction
### Strategies
```yaml
fatigue_reduction:
aggregate_alerts:
- Group related alerts
- Use inhibit rules
- Implement alert correlation
tune_thresholds:
- Base on SLOs, not arbitrary values
- Account for normal variance
- Use appropriate evaluation windows
automate_responses:
- Auto-remediation for known issues
- Self-healing infrastructure
- Automated scaling
regular_review:
- Weekly alert review
- Remove unused alerts
- Update thresholds based on data
Common Issues
Issue: Alert Storm
Problem: Too many alerts firing simultaneously
Solution: Implement proper grouping and inhibition rules
Issue: Missed Alerts
Problem: Critical alerts not reaching on-call
Solution: Test escalation policies, verify contact methods
Issue: False Positives
Problem: Alerts firing without actual issues
Solution: Tune thresholds, increase evaluation windows
Best Practices
- Define clear severity levels
- Every alert needs a runbook
- Test on-call notifications regularly
- Review and tune alerts weekly
- Implement proper escalation paths
- Use alert grouping and inhibition
- Track alert metrics (MTTR, frequency)
- Practice incident response regularly
Related Skills
- prometheus-grafana - Monitoring setup
- incident-response - Incident handling
- runbook-creation - Runbook creation