Runbook Creation Skill
When to Use This Skill
Use this skill when:
-
Runbook Creation tasks - Working on operational runbook templates for incident response and procedures
-
Planning or design - Need guidance on Runbook Creation approaches
-
Best practices - Want to follow established patterns and standards
Overview
Create operational runbooks for incident response, maintenance procedures, and operational tasks.
MANDATORY: Documentation-First Approach
Before creating runbooks:
-
Invoke docs-management skill for runbook patterns
-
Verify SRE best practices via MCP servers (perplexity)
-
Base guidance on Google SRE principles
Runbook Types
Runbook Categories:
┌─────────────────────────────────────────────────────────────────────────────┐ │ Incident Response Runbooks │ │ • Alert-triggered procedures │ │ • Escalation paths │ │ • Communication templates │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Operational Runbooks │ │ • Deployment procedures │ │ • Maintenance tasks │ │ • Backup/restore operations │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Troubleshooting Runbooks │ │ • Diagnostic procedures │ │ • Common issue resolution │ │ • Debug workflows │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Emergency Runbooks │ │ • Disaster recovery │ │ • Security incident response │ │ • Business continuity │ └─────────────────────────────────────────────────────────────────────────────┘
Standard Runbook Template
Runbook: [TITLE]
| Property | Value |
|---|---|
| ID | RB-[NUMBER] |
| Category | [Incident/Operational/Troubleshooting/Emergency] |
| Service | [Service Name] |
| Owner | [Team/Individual] |
| Last Updated | [YYYY-MM-DD] |
| Last Tested | [YYYY-MM-DD] |
| Review Frequency | [Quarterly/Monthly/Annually] |
Overview
Purpose: [What this runbook helps you accomplish]
When to Use: [Conditions that trigger this runbook]
Expected Outcome: [What success looks like]
Estimated Duration: [Time to complete]
Prerequisites
Required Access
- [System/Tool 1] - [Role/Permission needed]
- [System/Tool 2] - [Role/Permission needed]
Required Knowledge
- [Skill/Knowledge 1]
- [Skill/Knowledge 2]
Tools Needed
| Tool | Purpose | Access URL |
|---|---|---|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |
Quick Reference
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace] │
│ View logs: kubectl logs -f [pod-name] -n [namespace] │
│ Restart service: kubectl rollout restart deployment/[name] │
│ Check metrics: [monitoring-url] │
└────────────────────────────────────────────────────────────────┘
Procedure
Step 1: [Step Name]
Objective: [What this step accomplishes]
Actions:
-
[Action 1]
# Command example
kubectl get pods -n production
-
[Action 2]
Expected Result: [What you should see]
If This Fails: Go to Troubleshooting Section
Step 2: [Step Name]
Objective: [What this step accomplishes]
Actions:
- [Action 1]
- [Action 2]
Decision Point:
┌─────────────────────────────────────┐
│ Is the service responding? │
│ │
│ YES → Continue to Step 3 │
│ NO → Go to Step 4 (Escalation) │
└─────────────────────────────────────┘
Step 3: [Verification]
Objective: Verify the issue is resolved
Verification Checklist:
- Service is responding to health checks
- Metrics show normal values
- No new errors in logs
- Users can access the service
Troubleshooting
Issue: [Common Issue 1]
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
- [Step 1]
- [Step 2]
Issue: [Common Issue 2]
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
- [Step 1]
- [Step 2]
Escalation
When to Escalate
- Issue not resolved after [X] minutes
- Impact affects [threshold]
- Required access not available
- Unsure of next steps
Escalation Path
Level
Contact
Method
Response Time
L1
On-call Engineer
PagerDuty
15 min
L2
Team Lead
Slack #incidents
30 min
L3
Engineering Manager
Phone
1 hour
L4
VP Engineering
Phone
As needed
Communication
Status Updates
Template:
[TIMESTAMP] - [SERVICE] - [STATUS]
Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Planned action]
Stakeholder Notification
Stakeholder
When to Notify
Method
Engineering
Immediately
Slack
Product
If user-impacting
Slack
Support
If customer-facing
Email
Leadership
If SEV1/SEV2
Phone
Post-Incident
Cleanup Tasks
- Remove any temporary fixes
- Update monitoring/alerts if needed
- Document any new learnings
Post-Incident Review
- Schedule post-mortem meeting
- Gather timeline and evidence
- Identify action items
Appendix
Related Runbooks
- [RB-XXX: Related Runbook 1]
- [RB-YYY: Related Runbook 2]
Reference Documentation
- [Link to architecture docs]
- [Link to service docs]
Revision History
Version
Date
Author
Changes
1.0
[Date]
[Name]
Initial version
1.1
[Date]
[Name]
[Changes]
Incident Response Runbook Template
# Incident Runbook: [Alert Name]
| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |
---
## Alert Details
**Trigger Condition:**
```text
[Alert query/condition]
Example: error_rate > 1% for 5 minutes
Alert Meaning: [What this alert indicates]
False Positive Indicators: [Signs this might be a false alarm]
Immediate Actions (First 5 Minutes)
1. Acknowledge Alert
# Acknowledge in PagerDuty
pd incident:acknowledge
# Or via Slack
/pd ack
2. Assess Impact
Quick Health Checks:
# Check service status
curl -s https://api.example.com/health | jq .
# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR
# Check pod status
kubectl get pods -n production -l app=service
Impact Assessment:
Check
Command
Expected
Actual
Health endpoint
curl /health
200 OK
[Result]
Error rate
grep ERROR
< 10
[Result]
Pod status
kubectl get pods
Running
[Result]
3. Initial Communication
Post in #incidents:
🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]
Diagnosis
Common Causes and Checks
Cause 1: High Traffic
# Check request rate
kubectl top pods -n production -l app=service
# Check HPA status
kubectl get hpa -n production
If traffic spike confirmed:
- Scale replicas: kubectl scale deployment/service --replicas=10
- Enable rate limiting if available
Cause 2: Database Issues
# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check slow queries
kubectl logs -l app=service | grep "slow query"
If database issues:
- Check connection pool exhaustion
- Look for long-running queries
- Consider read replica failover
Cause 3: Dependency Failure
# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .
# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"
If dependency failure:
- Verify external service status
- Check for timeout configuration
- Consider enabling fallback behavior
Resolution Steps
Quick Fixes
Issue
Quick Fix
Command
Pod crash loop
Restart deployment
kubectl rollout restart deployment/service
Memory pressure
Increase limits
kubectl edit deployment/service
Config error
Rollback config
kubectl rollout undo deployment/service
Rollback Procedure
# List recent deployments
kubectl rollout history deployment/service -n production
# Rollback to previous version
kubectl rollout undo deployment/service -n production
# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2
Resolution Verification
Verification Checklist:
- Alert has cleared
- Health checks passing
- Error rate below threshold
- No user complaints in support channels
- Metrics returning to baseline
Monitoring Period: Monitor for 15 minutes after resolution
Closure
Update Status
✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]
Post-Incident Tasks
- Update incident timeline
- Create post-mortem doc if SEV1/SEV2
- File tickets for follow-up work
- Update runbook if needed
Database Failover Runbook
# Runbook: Database Failover
| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |
---
## Overview
**Purpose:** Failover from primary database to replica when primary is unavailable.
**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover
**Expected Outcome:** Application traffic routed to new primary
**Estimated Duration:** 15-30 minutes
---
## Prerequisites
### Required Access
- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser
### Pre-Failover Checks
```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary
# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
"SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"
Acceptable lag: < 1MB
Failover Procedure
Step 1: Confirm Primary is Unavailable
# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"
# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"
Expected: Connection timeout or error state
Step 2: Notify Stakeholders
🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes
Step 3: Promote Replica
# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
--resource-group rg-prod \
--name pg-replica
# Verify promotion
az postgres flexible-server show \
--resource-group rg-prod \
--name pg-replica \
--query "replicationRole"
Expected: replicationRole: None
(standalone)
Step 4: Update Connection Strings
# Update Kubernetes secret
kubectl create secret generic db-connection \
--from-literal=host=pg-replica.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production
Step 5: Verify Application Connectivity
# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database
# Test application health
curl -s https://api.example.com/health | jq .database
Post-Failover
Immediate Tasks
- Verify all applications connected to new primary
- Check for data consistency
- Monitor error rates
Recovery Tasks (Next 24 Hours)
- Investigate original primary failure
- Create new replica from new primary
- Update DNS/connection strings permanently
- Document incident and learnings
Rollback
If failover causes issues:
# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production
# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled
# Revert connection strings
kubectl create secret generic db-connection \
--from-literal=host=pg-primary.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production
Runbook Quality Checklist
Criterion
Description
Check
Actionable
Every step has a specific action
[ ]
Testable
Can be practiced in non-prod
[ ]
Current
Reflects current system state
[ ]
Complete
Covers happy and error paths
[ ]
Accessible
Available during incidents
[ ]
Versioned
Changes tracked with dates
[ ]
Workflow
When creating runbooks:
- Identify Need: What operation/incident needs documentation?
- Gather Information: Interview operators, review past incidents
- Draft Runbook: Use appropriate template
- Validate Steps: Walk through with subject matter expert
- Test in Non-Prod: Execute runbook in staging
- Publish: Add to runbook collection
- Train Team: Ensure operators know where to find it
- Maintain: Review and update regularly
References
For detailed guidance:
Last Updated: 2025-12-26