Incident Response
Handle production incidents systematically.
When to Use
-
Production is down or degraded
-
Critical errors affecting users
-
Security incidents
-
Data issues
-
Performance emergencies
Incident Workflow
DETECT → TRIAGE → MITIGATE → RESOLVE → REVIEW
- Detect & Triage
Quick health checks
curl -s https://api.example.com/health | jq . kubectl get pods -n production | grep -v Running
Check recent deployments
git log --oneline -5 kubectl rollout history deployment/app
Error rates
grep -c "ERROR" /var/log/app.log
- Mitigate First
Priority: Stop the bleeding before finding root cause
Rollback deployment
kubectl rollout undo deployment/app
Scale up if overloaded
kubectl scale deployment/app --replicas=10
Feature flag disable
curl -X POST api.example.com/admin/flags -d '{"feature": false}'
Circuit breaker
Block problematic endpoint or dependency
- Investigate
Recent logs
kubectl logs -l app=myapp --since=30m | grep -i error
Resource usage
kubectl top pods -n production
Database connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Network issues
curl -w "@curl-format.txt" -o /dev/null -s https://api.example.com
Severity Levels
Level Impact Response Time Example
P1 Complete outage Immediate Site down
P2 Major feature broken 15 min Payments failing
P3 Minor feature broken 1 hour Search slow
P4 Low impact Next day UI glitch
Communication Template
Incident Update
Status: Investigating | Identified | Mitigated | Resolved Severity: P1/P2/P3 Started: YYYY-MM-DD HH:MM UTC Duration: X hours
Summary
[1-2 sentences on what's happening]
Impact
[Who is affected and how]
Current Actions
- [Action 1]
- [Action 2]
Next Update
[Time of next update]
Post-Mortem Template
Incident Post-Mortem
Date: YYYY-MM-DD Duration: X hours Severity: P1
Summary
[What happened in 2-3 sentences]
Timeline
- HH:MM - [Event]
- HH:MM - [Event]
Root Cause
[Technical explanation]
Impact
- Users affected: X
- Revenue impact: $Y
- Data loss: None/Describe
Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add monitoring for X | @name | YYYY-MM-DD |
| Improve circuit breaker | @name | YYYY-MM-DD |
Lessons Learned
- [What we learned]
Examples
Input: "API is returning 500 errors" Action: Check logs, identify failing component, rollback if recent deploy, fix
Input: "Database is overloaded" Action: Kill long queries, scale read replicas, optimize or cache hot queries