Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

Setting up alerting rules and thresholds
Configuring on-call rotations and schedules
Implementing alert routing and escalation
Reducing alert fatigue
Managing incident response workflows

Prerequisites

Monitoring system (Prometheus, Datadog, etc.)
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

Severity levels

critical:

Service completely down
Data loss imminent
Security breach response: Immediate page, wake people up

high:

Service degraded significantly
Error rate above SLO
Capacity near limit response: Page during business hours, notify after hours

medium:

Performance degradation
Non-critical component failure
Warning thresholds exceeded response: Notify via Slack, review next business day

low:

Informational alerts
Capacity planning triggers
Routine maintenance needed response: Email notification, weekly review

Alert Design Principles

Good alert characteristics

alerts: actionable: - Every alert should require human action - Include runbook links - Clear remediation steps

relevant: - Alert on symptoms, not causes - Focus on user impact - Avoid alerting on expected behavior

timely: - Appropriate thresholds - Suitable evaluation windows - Account for normal variance

unique: - No duplicate alerts - Proper alert grouping - Clear ownership

Prometheus Alerting

Alert Rules

prometheus/rules/alerts.yml

groups:

name: service_alerts rules:

High-level service health
- alert: ServiceDown expr: up{job="myapp"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute." runbook_url: "https://wiki.example.com/runbooks/service-down"
Error rate alert
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
Latency alert (SLO-based)
- alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 0.5 for: 5m labels: severity: high annotations: summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

templates:

'/etc/alertmanager/templates/*.tmpl'

route: receiver: 'default-receiver' group_by: ['alertname', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h

routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' group_wait: 0s repeat_interval: 1h

# High severity during business hours
- match:
    severity: high
  receiver: 'slack-high'
  active_time_intervals:
    - business-hours

# Route by team
- match_re:
    team: platform.*
  receiver: 'platform-team'

receivers:

name: 'default-receiver' slack_configs:
- channel: '#alerts' send_resolved: true
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'xxx' severity: critical description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ template "pagerduty.firing" . }}'
name: 'slack-high' slack_configs:
- channel: '#alerts-high' title: '{{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' actions:
  - type: button text: 'Runbook' url: '{{ .CommonAnnotations.runbook_url }}'
  - type: button text: 'Dashboard' url: '{{ .CommonAnnotations.dashboard_url }}'
name: 'platform-team' slack_configs:
- channel: '#platform-alerts'

time_intervals:

name: business-hours time_intervals:
- weekdays: ['monday:friday'] times:
  - start_time: '09:00' end_time: '17:00'

inhibit_rules:

source_match: severity: critical target_match: severity: high equal: ['service']

PagerDuty Integration

Service Configuration

Terraform example

resource "pagerduty_service" "myapp" { name = "MyApp Production" description = "Production application service" escalation_policy = pagerduty_escalation_policy.default.id alert_creation = "create_alerts_and_incidents" auto_resolve_timeout = 14400 # 4 hours acknowledgement_timeout = 600 # 10 minutes

incident_urgency_rule { type = "use_support_hours"

during_support_hours {
  type    = "constant"
  urgency = "high"
}

outside_support_hours {
  type    = "constant"
  urgency = "low"
}

} }

resource "pagerduty_escalation_policy" "default" { name = "Default Escalation" num_loops = 2

rule { escalation_delay_in_minutes = 10 target { type = "schedule_reference" id = pagerduty_schedule.primary.id } }

rule { escalation_delay_in_minutes = 15 target { type = "user_reference" id = pagerduty_user.manager.id } } }

Schedule Configuration

resource "pagerduty_schedule" "primary" { name = "Primary On-Call" time_zone = "America/New_York"

layer { name = "Weekly Rotation" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 604800 # 1 week users = [for user in pagerduty_user.oncall : user.id] }

Override layer for holidays

layer { name = "Holiday Coverage" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 86400 users = [pagerduty_user.holiday_coverage.id]

restriction {
  type              = "daily_restriction"
  start_time_of_day = "00:00:00"
  duration_seconds  = 86400
  start_day_of_week = 0  # Sunday
}

} }

Grafana OnCall

Integration Setup

docker-compose.yml addition

services: oncall: image: grafana/oncall environment: - SECRET_KEY=your-secret-key - BASE_URL=http://oncall:8080 - GRAFANA_API_URL=http://grafana:3000 ports: - "8080:8080"

Escalation Chain

Example escalation chain structure

escalation_chains:

name: "Production Critical" steps:
- step: 1 type: notify persons:
  - "@oncall-primary" wait_delay: 0
- step: 2 type: notify persons:
  - "@oncall-secondary" wait_delay: 5m
- step: 3 type: notify persons:
  - "@engineering-manager" wait_delay: 10m
- step: 4 type: trigger_action action: "escalate_to_incident_commander" wait_delay: 15m

Alert Templates

Slack Alert Template

{{ define "slack.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}

{{ define "slack.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Description: {{ .Annotations.description }} Runbook: {{ .Annotations.runbook_url }} {{ end }} {{ end }}

PagerDuty Details Template

{{ define "pagerduty.firing" }} {{ range .Alerts.Firing }} Alert: {{ .Labels.alertname }} Service: {{ .Labels.service }} Instance: {{ .Labels.instance }} Value: {{ .Annotations.value }} Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }} {{ end }} {{ end }}

On-Call Best Practices

Rotation Guidelines

on_call_guidelines: rotation_length: 1 week handoff_time: "10:00 AM Monday"

responsibilities: - Monitor alerts during shift - Respond within SLA (critical: 5min, high: 15min) - Document incidents - Handoff unresolved issues

support: - Secondary on-call for backup - Clear escalation path - Manager availability for major incidents

wellness: - Maximum 1 week on-call per month - Comp time after high-alert periods - No-interrupt recovery day after shift

Runbook Template

Alert: High Error Rate

Summary

Error rate has exceeded the threshold of 5% for the service.

Impact

Users may experience errors when accessing the application.

Investigation Steps

Check service logs: kubectl logs -l app=myapp -n production
Review recent deployments: kubectl rollout history deployment/myapp
Check database connectivity: kubectl exec -it myapp -- nc -zv postgres 5432
Review error traces in APM dashboard

Remediation

If caused by recent deployment:

kubectl rollout undo deployment/myapp -n production

If database related:

kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

- Database team: @db-oncall

- Platform team: @platform-oncall

## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously
Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call
Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues
Solution: Tune thresholds, increase evaluation windows

Best Practices

- Define clear severity levels

- Every alert needs a runbook

- Test on-call notifications regularly

- Review and tune alerts weekly

- Implement proper escalation paths

- Use alert grouping and inhibition

- Track alert metrics (MTTR, frequency)

- Practice incident response regularly

Related Skills

- prometheus-grafana - Monitoring setup

- incident-response - Incident handling

- runbook-creation - Runbook creation

alerting-oncall

Safety Notice

Copy this and send it to your AI assistant to learn

Severity levels

Good alert characteristics

prometheus/rules/alerts.yml

High-level service health

Error rate alert

Latency alert (SLO-based)

alertmanager.yml

Terraform example

Override layer for holidays

docker-compose.yml addition

Example escalation chain structure

Alert: High Error Rate

Summary

Impact

Investigation Steps

Remediation

If caused by recent deployment:

Source Transparency

Related Skills

sops-encryption

linux-administration

linux-hardening

vpn-setup