Incident Runbook Generator
Create actionable runbooks for common incidents.
Runbook Template
Runbook: Database Connection Pool Exhausted
Severity: P1 (Critical) Estimated Time to Resolve: 15-30 minutes Owner: Database Team (On-call)
Symptoms
- Application errors: "connection pool exhausted"
- Increased API latency (>5s)
- Failed health checks
- CloudWatch alarm:
DatabaseConnectionsHigh
Detection
- Alert: DatabaseConnectionPoolExhausted
- Metrics:
active_connections > max_connections * 0.9 - Logs: "Error: connect ETIMEDOUT"
Immediate Actions (5 min)
- Verify the issue
# Check current connections SELECT count(*) FROM pg_stat_activity;
Identify long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;
Kill blocking queries (if safe)
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';
Mitigation (10 min)
Scale up connection pool (temporary)
Update RDS parameter group
aws rds modify-db-parameter-group
--db-parameter-group-name prod-params
--parameters "ParameterName=max_connections,ParameterValue=200"
Restart application (if needed)
kubectl rollout restart deployment/api
Monitor recovery
watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity;"'
Root Cause Investigation
Check for:
-
Recent deployment (new code with connection leaks)
-
Traffic spike (legitimate or DDoS)
-
Slow queries holding connections
-
Connection pool configuration too small
-
Application not releasing connections
Rollback Steps
If caused by deployment:
Rollback to previous version
kubectl rollout undo deployment/api
Verify
kubectl rollout status deployment/api
Communication Template
Initial (within 5 min):
🚨 INCIDENT: Database connection pool exhausted Status: Investigating Impact: API errors and slowness ETA: 15-30 min Next update: 10 min
Update (every 10 min):
UPDATE: Killed long-running queries Status: Mitigating Impact: Still degraded, improving Actions: Scaling connection pool Next update: 10 min
Resolution:
✅ RESOLVED: Database connections normalized Duration: 25 minutes Root cause: Connection leak in v2.3.4 Fix: Rolled back to v2.3.3 Follow-up: Bug fix PR #1234 Postmortem: [link]
Prevention
-
Add connection pool metrics to dashboards
-
Implement connection timeout (30s)
-
Add connection leak detection in tests
-
Set up pre-deployment load testing
-
Review connection pool sizing
Related Runbooks
-
Database High CPU
-
Slow Database Queries
-
Application OOM
Output Checklist
- Symptoms documented
- Detection criteria
- Step-by-step actions
- Owner assigned
- Rollback procedure
- Communication templates
- Prevention measures ENDFILE