alerting-oncall

Alerting & On-Call

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "alerting-oncall" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-alerting-oncall

Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

  • Setting up alerting rules and thresholds

  • Configuring on-call rotations and schedules

  • Implementing alert routing and escalation

  • Reducing alert fatigue

  • Managing incident response workflows

Prerequisites

  • Monitoring system (Prometheus, Datadog, etc.)

  • On-call platform (PagerDuty, Opsgenie, Grafana OnCall)

  • Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

Severity levels

critical:

  • Service completely down
  • Data loss imminent
  • Security breach response: Immediate page, wake people up

high:

  • Service degraded significantly
  • Error rate above SLO
  • Capacity near limit response: Page during business hours, notify after hours

medium:

  • Performance degradation
  • Non-critical component failure
  • Warning thresholds exceeded response: Notify via Slack, review next business day

low:

  • Informational alerts
  • Capacity planning triggers
  • Routine maintenance needed response: Email notification, weekly review

Alert Design Principles

Good alert characteristics

alerts: actionable: - Every alert should require human action - Include runbook links - Clear remediation steps

relevant: - Alert on symptoms, not causes - Focus on user impact - Avoid alerting on expected behavior

timely: - Appropriate thresholds - Suitable evaluation windows - Account for normal variance

unique: - No duplicate alerts - Proper alert grouping - Clear ownership

Prometheus Alerting

Alert Rules

prometheus/rules/alerts.yml

groups:

  • name: service_alerts rules:

    High-level service health

    • alert: ServiceDown expr: up{job="myapp"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute." runbook_url: "https://wiki.example.com/runbooks/service-down"

    Error rate alert

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

    Latency alert (SLO-based)

    • alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 0.5 for: 5m labels: severity: high annotations: summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

templates:

  • '/etc/alertmanager/templates/*.tmpl'

route: receiver: 'default-receiver' group_by: ['alertname', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h

routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' group_wait: 0s repeat_interval: 1h

# High severity during business hours
- match:
    severity: high
  receiver: 'slack-high'
  active_time_intervals:
    - business-hours

# Route by team
- match_re:
    team: platform.*
  receiver: 'platform-team'

receivers:

  • name: 'default-receiver' slack_configs:

    • channel: '#alerts' send_resolved: true
  • name: 'pagerduty-critical' pagerduty_configs:

    • service_key: 'xxx' severity: critical description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ template "pagerduty.firing" . }}'
  • name: 'slack-high' slack_configs:

    • channel: '#alerts-high' title: '{{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' actions:
      • type: button text: 'Runbook' url: '{{ .CommonAnnotations.runbook_url }}'
      • type: button text: 'Dashboard' url: '{{ .CommonAnnotations.dashboard_url }}'
  • name: 'platform-team' slack_configs:

    • channel: '#platform-alerts'

time_intervals:

  • name: business-hours time_intervals:
    • weekdays: ['monday:friday'] times:
      • start_time: '09:00' end_time: '17:00'

inhibit_rules:

  • source_match: severity: critical target_match: severity: high equal: ['service']

PagerDuty Integration

Service Configuration

Terraform example

resource "pagerduty_service" "myapp" { name = "MyApp Production" description = "Production application service" escalation_policy = pagerduty_escalation_policy.default.id alert_creation = "create_alerts_and_incidents" auto_resolve_timeout = 14400 # 4 hours acknowledgement_timeout = 600 # 10 minutes

incident_urgency_rule { type = "use_support_hours"

during_support_hours {
  type    = "constant"
  urgency = "high"
}

outside_support_hours {
  type    = "constant"
  urgency = "low"
}

} }

resource "pagerduty_escalation_policy" "default" { name = "Default Escalation" num_loops = 2

rule { escalation_delay_in_minutes = 10 target { type = "schedule_reference" id = pagerduty_schedule.primary.id } }

rule { escalation_delay_in_minutes = 15 target { type = "user_reference" id = pagerduty_user.manager.id } } }

Schedule Configuration

resource "pagerduty_schedule" "primary" { name = "Primary On-Call" time_zone = "America/New_York"

layer { name = "Weekly Rotation" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 604800 # 1 week users = [for user in pagerduty_user.oncall : user.id] }

Override layer for holidays

layer { name = "Holiday Coverage" start = "2024-01-01T00:00:00-05:00" rotation_virtual_start = "2024-01-01T00:00:00-05:00" rotation_turn_length_seconds = 86400 users = [pagerduty_user.holiday_coverage.id]

restriction {
  type              = "daily_restriction"
  start_time_of_day = "00:00:00"
  duration_seconds  = 86400
  start_day_of_week = 0  # Sunday
}

} }

Grafana OnCall

Integration Setup

docker-compose.yml addition

services: oncall: image: grafana/oncall environment: - SECRET_KEY=your-secret-key - BASE_URL=http://oncall:8080 - GRAFANA_API_URL=http://grafana:3000 ports: - "8080:8080"

Escalation Chain

Example escalation chain structure

escalation_chains:

  • name: "Production Critical" steps:
    • step: 1 type: notify persons:

      • "@oncall-primary" wait_delay: 0
    • step: 2 type: notify persons:

      • "@oncall-secondary" wait_delay: 5m
    • step: 3 type: notify persons:

      • "@engineering-manager" wait_delay: 10m
    • step: 4 type: trigger_action action: "escalate_to_incident_commander" wait_delay: 15m

Alert Templates

Slack Alert Template

{{ define "slack.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}

{{ define "slack.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Description: {{ .Annotations.description }} Runbook: {{ .Annotations.runbook_url }} {{ end }} {{ end }}

PagerDuty Details Template

{{ define "pagerduty.firing" }} {{ range .Alerts.Firing }} Alert: {{ .Labels.alertname }} Service: {{ .Labels.service }} Instance: {{ .Labels.instance }} Value: {{ .Annotations.value }} Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }} {{ end }} {{ end }}

On-Call Best Practices

Rotation Guidelines

on_call_guidelines: rotation_length: 1 week handoff_time: "10:00 AM Monday"

responsibilities: - Monitor alerts during shift - Respond within SLA (critical: 5min, high: 15min) - Document incidents - Handoff unresolved issues

support: - Secondary on-call for backup - Clear escalation path - Manager availability for major incidents

wellness: - Maximum 1 week on-call per month - Comp time after high-alert periods - No-interrupt recovery day after shift

Runbook Template

Alert: High Error Rate

Summary

Error rate has exceeded the threshold of 5% for the service.

Impact

Users may experience errors when accessing the application.

Investigation Steps

  1. Check service logs: kubectl logs -l app=myapp -n production
  2. Review recent deployments: kubectl rollout history deployment/myapp
  3. Check database connectivity: kubectl exec -it myapp -- nc -zv postgres 5432
  4. Review error traces in APM dashboard

Remediation

If caused by recent deployment:

kubectl rollout undo deployment/myapp -n production

If database related:

kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

- Database team: @db-oncall

- Platform team: @platform-oncall

## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously
Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call
Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues
Solution: Tune thresholds, increase evaluation windows

Best Practices

- Define clear severity levels

- Every alert needs a runbook

- Test on-call notifications regularly

- Review and tune alerts weekly

- Implement proper escalation paths

- Use alert grouping and inhibition

- Track alert metrics (MTTR, frequency)

- Practice incident response regularly

Related Skills

- prometheus-grafana - Monitoring setup

- incident-response - Incident handling

- runbook-creation - Runbook creation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review
Security

vpn-setup

No summary provided by upstream source.

Repository SourceNeeds Review