Runbook Creation Skill

When to Use This Skill

Use this skill when:

Runbook Creation tasks - Working on operational runbook templates for incident response and procedures
Planning or design - Need guidance on Runbook Creation approaches
Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response, maintenance procedures, and operational tasks.

MANDATORY: Documentation-First Approach

Before creating runbooks:

Invoke docs-management skill for runbook patterns
Verify SRE best practices via MCP servers (perplexity)
Base guidance on Google SRE principles

Runbook Types

Runbook Categories:

┌───────── │ Incident Response Runbooks │ • Alert-triggered procedures │ • Escalation paths │ • Communication templates ├───────── │ Operational Runbooks │ • Deployment procedures │ • Maintenance tasks │ • Backup/restore operations ├───────── │ Troubleshooting Runbooks │ • Diagnostic procedures │ • Common issue resolution │ • Debug workflows ├───────── │ Emergency Runbooks │ • Disaster recovery │ • Security incident response │ • Business continuity └───────── ────────────────────────────────────────────────────────────────────┐ │ │ │ │ ────────────────────────────────────────────────────────────────────┤ │ │ │ │ ────────────────────────────────────────────────────────────────────┤ │ │ │ │ ────────────────────────────────────────────────────────────────────┤ │ │ │ │ ────────────────────────────────────────────────────────────────────┘

Standard Runbook Template

Runbook: [TITLE]

Property	Value
ID	RB-[NUMBER]
Category	[Incident/Operational/Troubleshooting/Emergency]
Service	[Service Name]
Owner	[Team/Individual]
Last Updated	[YYYY-MM-DD]
Last Tested	[YYYY-MM-DD]
Review Frequency	[Quarterly/Monthly/Annually]

Overview

Purpose: [What this runbook helps you accomplish]

When to Use: [Conditions that trigger this runbook]

Expected Outcome: [What success looks like]

Estimated Duration: [Time to complete]

Prerequisites

Required Access

[System/Tool 1] - [Role/Permission needed]
[System/Tool 2] - [Role/Permission needed]

Required Knowledge

[Skill/Knowledge 1]
[Skill/Knowledge 2]

Tools Needed

Tool	Purpose	Access URL
[Tool 1]	[Purpose]	[URL/Link]
[Tool 2]	[Purpose]	[URL/Link]

Quick Reference

Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace]          │
│ View logs: kubectl logs -f [pod-name] -n [namespace]           │
│ Restart service: kubectl rollout restart deployment/[name]     │
│ Check metrics: [monitoring-url]                                │
└────────────────────────────────────────────────────────────────┘

Procedure

Step 1: [Step Name]

Objective: [What this step accomplishes]

Actions:

- 
[Action 1]

# Command example
kubectl get pods -n production

- 
[Action 2]

Expected Result: [What you should see]

If This Fails: Go to Troubleshooting Section

Step 2: [Step Name]

Objective: [What this step accomplishes]

Actions:

- [Action 1]

- [Action 2]

Decision Point:

┌─────────────────────────────────────┐
│ Is the service responding?          │
│                                     │
│ YES → Continue to Step 3            │
│ NO  → Go to Step 4 (Escalation)     │
└─────────────────────────────────────┘

Step 3: [Verification]

Objective: Verify the issue is resolved

Verification Checklist:

-  Service is responding to health checks

-  Metrics show normal values

-  No new errors in logs

-  Users can access the service

Troubleshooting

Issue: [Common Issue 1]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

- [Step 1]

- [Step 2]

Issue: [Common Issue 2]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

- [Step 1]

- [Step 2]

Escalation

When to Escalate

-  Issue not resolved after [X] minutes

-  Impact affects [threshold]

-  Required access not available

-  Unsure of next steps

Escalation Path

Level
Contact
Method
Response Time

L1
On-call Engineer
PagerDuty
15 min

L2
Team Lead
Slack #incidents
30 min

L3
Engineering Manager
Phone
1 hour

L4
VP Engineering
Phone
As needed

Communication

Status Updates

Template:

[TIMESTAMP] - [SERVICE] - [STATUS]

Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Planned action]

Stakeholder Notification

Stakeholder
When to Notify
Method

Engineering
Immediately
Slack

Product
If user-impacting
Slack

Support
If customer-facing
Email

Leadership
If SEV1/SEV2
Phone

Post-Incident

Cleanup Tasks

-  Remove any temporary fixes

-  Update monitoring/alerts if needed

-  Document any new learnings

Post-Incident Review

-  Schedule post-mortem meeting

-  Gather timeline and evidence

-  Identify action items

Appendix

Related Runbooks

- [RB-XXX: Related Runbook 1]

- [RB-YYY: Related Runbook 2]

Reference Documentation

- [Link to architecture docs]

- [Link to service docs]

Revision History

Version
Date
Author
Changes

1.0
[Date]
[Name]
Initial version

1.1
[Date]
[Name]
[Changes]

Incident Response Runbook Template

# Incident Runbook: [Alert Name]

| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |

---

## Alert Details

**Trigger Condition:**
```text

[Alert query/condition]
Example: error_rate > 1% for 5 minutes

Alert Meaning: [What this alert indicates]

False Positive Indicators: [Signs this might be a false alarm]

Immediate Actions (First 5 Minutes)

1. Acknowledge Alert

# Acknowledge in PagerDuty
pd incident:acknowledge

# Or via Slack
/pd ack

2. Assess Impact

Quick Health Checks:

# Check service status
curl -s https://api.example.com/health | jq .

# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR

# Check pod status
kubectl get pods -n production -l app=service

Impact Assessment:

Check
Command
Expected
Actual

Health endpoint
curl /health

200 OK
[Result]

Error rate
grep ERROR

&#x3C; 10
[Result]

Pod status
kubectl get pods

Running
[Result]

3. Initial Communication

Post in #incidents:

🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]

Diagnosis

Common Causes and Checks

Cause 1: High Traffic

# Check request rate
kubectl top pods -n production -l app=service

# Check HPA status
kubectl get hpa -n production

If traffic spike confirmed:

- Scale replicas: kubectl scale deployment/service --replicas=10

- Enable rate limiting if available

Cause 2: Database Issues

# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check slow queries
kubectl logs -l app=service | grep "slow query"

If database issues:

- Check connection pool exhaustion

- Look for long-running queries

- Consider read replica failover

Cause 3: Dependency Failure

# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .

# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"

If dependency failure:

- Verify external service status

- Check for timeout configuration

- Consider enabling fallback behavior

Resolution Steps

Quick Fixes

Issue
Quick Fix
Command

Pod crash loop
Restart deployment
kubectl rollout restart deployment/service

Memory pressure
Increase limits
kubectl edit deployment/service

Config error
Rollback config
kubectl rollout undo deployment/service

Rollback Procedure

# List recent deployments
kubectl rollout history deployment/service -n production

# Rollback to previous version
kubectl rollout undo deployment/service -n production

# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2

Resolution Verification

Verification Checklist:

-  Alert has cleared

-  Health checks passing

-  Error rate below threshold

-  No user complaints in support channels

-  Metrics returning to baseline

Monitoring Period: Monitor for 15 minutes after resolution

Closure

Update Status

✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]

Post-Incident Tasks

-  Update incident timeline

-  Create post-mortem doc if SEV1/SEV2

-  File tickets for follow-up work

-  Update runbook if needed

Database Failover Runbook

# Runbook: Database Failover

| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |

---

## Overview

**Purpose:** Failover from primary database to replica when primary is unavailable.

**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover

**Expected Outcome:** Application traffic routed to new primary

**Estimated Duration:** 15-30 minutes

---

## Prerequisites

### Required Access

- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser

### Pre-Failover Checks

```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary

# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
  "SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"

Acceptable lag: &#x3C; 1MB

Failover Procedure

Step 1: Confirm Primary is Unavailable

# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"

# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"

Expected: Connection timeout or error state

Step 2: Notify Stakeholders

🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes

Step 3: Promote Replica

# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
  --resource-group rg-prod \
  --name pg-replica

# Verify promotion
az postgres flexible-server show \
  --resource-group rg-prod \
  --name pg-replica \
  --query "replicationRole"

Expected: replicationRole: None
 (standalone)

Step 4: Update Connection Strings

# Update Kubernetes secret
kubectl create secret generic db-connection \
  --from-literal=host=pg-replica.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production

Step 5: Verify Application Connectivity

# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database

# Test application health
curl -s https://api.example.com/health | jq .database

Post-Failover

Immediate Tasks

-  Verify all applications connected to new primary

-  Check for data consistency

-  Monitor error rates

Recovery Tasks (Next 24 Hours)

-  Investigate original primary failure

-  Create new replica from new primary

-  Update DNS/connection strings permanently

-  Document incident and learnings

Rollback

If failover causes issues:

# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production

# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled

# Revert connection strings
kubectl create secret generic db-connection \
  --from-literal=host=pg-primary.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production

Runbook Quality Checklist

Criterion
Description
Check

Actionable
Every step has a specific action
[ ]

Testable
Can be practiced in non-prod
[ ]

Current
Reflects current system state
[ ]

Complete
Covers happy and error paths
[ ]

Accessible
Available during incidents
[ ]

Versioned
Changes tracked with dates
[ ]

Workflow

When creating runbooks:

- Identify Need: What operation/incident needs documentation?

- Gather Information: Interview operators, review past incidents

- Draft Runbook: Use appropriate template

- Validate Steps: Walk through with subject matter expert

- Test in Non-Prod: Execute runbook in staging

- Publish: Add to runbook collection

- Train Team: Ensure operators know where to find it

- Maintain: Review and update regularly

References

For detailed guidance:

Last Updated: 2025-12-26

runbook-creation

Safety Notice

Copy this and send it to your AI assistant to learn

Runbook: [TITLE]

Overview

Prerequisites

Required Access

Required Knowledge

Tools Needed

Quick Reference

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering