DevOps Automation
Automate DevOps workflows including CI/CD pipelines, monitoring, incident management, and infrastructure operations. Based on n8n's IT Ops workflow templates.
Overview
This skill covers:
- CI/CD pipeline automation
- Monitoring and alerting
- Incident management
- Infrastructure automation
- Deployment workflows
CI/CD Automation
GitHub Actions Integration
workflow: "GitHub CI/CD Notifications"
triggers:
- github_push
- github_pull_request
- github_workflow_run
on_push:
action:
- trigger_ci: if_main_branch
- notify_slack:
channel: "#deployments"
message: |
📦 *New Push to {branch}*
Commit: `{commit_sha_short}`
Author: {author}
Message: {commit_message}
[View Diff]({compare_url})
on_pr_opened:
action:
- notify_slack:
channel: "#code-review"
message: |
🔀 *New Pull Request*
Title: {pr_title}
Author: {author}
Branch: {head} → {base}
[Review PR]({pr_url})
- assign_reviewers: based_on_codeowners
- run_ci_checks
on_workflow_complete:
action:
- notify_slack:
message: |
{status_emoji} *Build {status}*
Workflow: {workflow_name}
Branch: {branch}
Duration: {duration}
{if_failed: [View Logs]({logs_url})}
Deployment Pipeline
deployment_pipeline:
stages:
build:
trigger: push_to_main
steps:
- checkout_code
- install_dependencies
- run_tests
- build_artifact
- push_to_registry
staging:
trigger: build_success
steps:
- deploy_to_staging
- run_integration_tests
- notify_qa
production:
trigger: manual_approval
steps:
- create_backup
- deploy_to_production
- run_smoke_tests
- notify_team
rollback:
trigger: deployment_failed OR manual
steps:
- revert_to_previous
- notify_team
- create_incident
Monitoring & Alerting
Alert Routing
alert_routing:
sources:
- prometheus
- datadog
- cloudwatch
- new_relic
severity_levels:
critical:
response_time: 5_minutes
channels: [pagerduty, slack_urgent, sms]
escalation: immediate
high:
response_time: 15_minutes
channels: [slack_alerts, email]
escalation: after_15_minutes
medium:
response_time: 1_hour
channels: [slack_alerts]
low:
response_time: 24_hours
channels: [slack_logging]
routing_rules:
- if: service == "payments"
team: payments_oncall
severity_boost: +1
- if: service == "auth"
team: security_oncall
- default:
team: platform_oncall
Alert Templates
alert_templates:
infrastructure:
cpu_high:
title: "🔥 High CPU Usage"
body: |
Server: {host}
CPU: {cpu_percent}%
Duration: {duration}
Threshold: {threshold}%
[View Dashboard]({grafana_url})
memory_critical:
title: "💾 Critical Memory"
body: |
Server: {host}
Memory: {memory_percent}%
Available: {available_mb}MB
[SSH to Server]({ssh_link})
disk_full:
title: "💿 Disk Space Critical"
body: |
Server: {host}
Disk: {disk_percent}%
Available: {available_gb}GB
Suggestion: Clean logs or expand volume
application:
error_spike:
title: "📈 Error Rate Spike"
body: |
Service: {service}
Error Rate: {error_rate}%
Normal: {baseline}%
Top Errors:
{top_errors}
latency_high:
title: "🐢 High Latency"
body: |
Service: {service}
P99 Latency: {p99_ms}ms
Threshold: {threshold_ms}ms
Incident Management
Incident Workflow
incident_workflow:
detection:
sources: [monitoring, user_report, automated_check]
triage:
auto_severity:
- if: affects_payments
severity: critical
- if: affects_auth
severity: critical
- if: affects_api AND error_rate > 10%
severity: high
response:
critical:
- create_incident_channel: "#inc-{timestamp}"
- page_oncall: immediately
- notify_stakeholders: [engineering_lead, product]
- start_war_room: zoom_link
- create_status_page: incident
high:
- create_incident_channel
- notify_oncall: slack
- create_ticket: jira
communication:
internal:
frequency: every_30_minutes
channel: incident_channel
template: |
📊 *Incident Update*
Status: {status}
Impact: {impact}
Next update: {next_update_time}
Current actions:
{action_items}
external:
channel: status_page
template: customer_facing_update
resolution:
steps:
- confirm_resolution
- update_status_page: resolved
- notify_stakeholders
- schedule_postmortem
- close_incident_channel: after_24h
Postmortem Template
postmortem_template:
sections:
summary:
- incident_title
- duration
- severity
- impact
timeline:
format: |
| Time | Event |
|------|-------|
| {time} | {event} |
root_cause:
- what_happened
- why_it_happened
- contributing_factors
impact:
- users_affected
- revenue_impact
- sla_breach
resolution:
- how_it_was_fixed
- time_to_detect
- time_to_resolve
action_items:
format: |
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
lessons_learned:
- what_went_well
- what_went_poorly
- lucky_breaks
Infrastructure Automation
Server Provisioning
provisioning_workflow:
trigger: jira_ticket OR slack_request
steps:
1. validate_request:
check: [budget_approval, security_review]
2. create_infrastructure:
terraform:
- vpc
- security_groups
- ec2_instances
- load_balancer
3. configure_server:
ansible:
- base_configuration
- security_hardening
- monitoring_agent
- application_setup
4. validate:
- health_check
- security_scan
- performance_baseline
5. notify:
slack: "✅ Server {hostname} is ready"
include: [ssh_access, dashboard_link]
Scheduled Maintenance
maintenance_automation:
tasks:
certificate_renewal:
schedule: "30 days before expiry"
action:
- request_new_cert: letsencrypt
- deploy_cert
- verify_ssl
- notify: if_failure
security_patching:
schedule: "weekly"
action:
- check_updates
- if_critical: immediate_patch
- else: schedule_maintenance_window
log_rotation:
schedule: "daily"
action:
- rotate_logs
- compress_old
- upload_to_s3
- delete_local: older_than_7_days
backup_verification:
schedule: "weekly"
action:
- restore_to_test_env
- run_integrity_checks
- report_status
Kubernetes Automation
K8s Workflows
kubernetes_automation:
deployment:
trigger: docker_image_pushed
steps:
- update_manifest: with_new_image_tag
- apply_to_staging
- run_tests
- if_success: apply_to_production
scaling:
trigger: metric_threshold
rules:
- if: cpu > 80%
action: scale_up
max_replicas: 10
- if: cpu < 20%
action: scale_down
min_replicas: 2
rollback:
trigger: health_check_failed
action:
- kubectl_rollout_undo
- notify_team
- create_incident
Output Example
Request: "Set up deployment notifications for GitHub Actions"
Output:
# GitHub Actions Deployment Notifications
## n8n Workflow
```yaml
trigger: GitHub Webhook
events: [workflow_run]
Notification Templates
Build Started:
🚀 *Deployment Started*
Branch: main
Commit: abc1234
Author: @developer
Triggered by: Push
[View Workflow](https://github.com/...)
Build Success:
✅ *Deployment Successful*
Environment: Production
Duration: 3m 42s
Version: v1.2.3
Changes:
• Feature X
• Bug fix Y
[View Deployment](https://app.example.com)
Build Failed:
❌ *Deployment Failed*
Stage: Test
Error: npm test failed
[View Logs](https://github.com/...)
[Retry](https://github.com/...)
Slack Integration
channel: "#deployments"
mention_on_failure: "@oncall"
thread_replies: true
---
*DevOps Automation Skill - Part of Claude Office Skills*