Postmortem Writer
Document incidents for learning and improvement.
Postmortem Template
Postmortem: API Outage - Database Connection Pool Exhausted
Date: 2024-01-15 Authors: Jane Doe (On-call), John Smith (DBA) Status: Complete Severity: P1 (Critical)
Summary
On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.
Impact:
- Duration: 25 minutes
- Users affected: ~50,000
- Requests failed: ~125,000
- Revenue impact: ~$15,000
Timeline (All times UTC)
| Time | Event |
|---|---|
| 14:15 | v2.3.4 deployed to production |
| 14:32 | First CloudWatch alarm: HighErrorRate |
| 14:33 | PagerDuty alert sent to on-call (Jane) |
| 14:35 | Jane acknowledges, begins investigation |
| 14:38 | Identified: Database connection pool at 100% |
| 14:40 | Attempted: Kill long-running queries (no effect) |
| 14:43 | Decision: Rollback to v2.3.3 |
| 14:45 | Rollback initiated |
| 14:47 | Rollback complete, connections dropping |
| 14:50 | Error rate returning to normal |
| 14:57 | All systems recovered, incident closed |
| 15:30 | Postmortem meeting scheduled |
Root Cause
A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.
Code diff: ```diff
- await prisma.user.findUnique({ where: { id } });
- const client = await pool.connect();
- const user = await client.query('SELECT * FROM users WHERE id = $1', [id]);
- // Missing: client.release() ❌ ```
Contributing Factors
-
Insufficient testing: Load tests didn't catch the leak
- Tests only ran for 5 minutes
- Not enough concurrent connections to exhaust pool
-
Missing monitoring: No alerts on connection pool metrics
- Had alarms for query latency
- No alarms for active connections count
-
Inadequate code review: Reviewer didn't spot missing release()
- PR approved without running locally
- No checklist for connection management
-
Deployment process: No gradual rollout
- Deployed to 100% of production immediately
- No canary deployment
What Went Well
- ✅ Fast detection: Alert fired within 3 minutes
- ✅ Clear runbook: DBA runbook had exact steps to follow
- ✅ Quick decision: Made rollback decision in 8 minutes
- ✅ Communication: Status page updated every 5 minutes
- ✅ Rollback capability: Automated rollback took <2 minutes
What Went Wrong
- ❌ Code review missed bug: Connection leak not caught
- ❌ Testing gaps: Load tests insufficient duration
- ❌ No canary: Deployed to all instances at once
- ❌ Late detection: 17 minutes between deploy and alert
Action Items
| Action | Owner | Due Date | Priority | Status |
|---|---|---|---|---|
| Add connection pool metrics to dashboards | Jane | 2024-01-20 | P0 | ✅ Done |
| Create PR checklist for connection management | John | 2024-01-22 | P0 | ✅ Done |
| Extend load tests to 30 minutes minimum | QA Team | 2024-01-25 | P1 | 🔄 In Progress |
| Implement canary deployment (10% → 100%) | DevOps | 2024-02-01 | P1 | 📋 Planned |
| Add connection leak detection to tests | Jane | 2024-01-27 | P1 | 🔄 In Progress |
| Review all DB connection usage patterns | John | 2024-02-05 | P2 | 📋 Planned |
| Improve alert routing (faster escalation) | DevOps | 2024-02-10 | P2 | 📋 Planned |
Lessons Learned
- Code review checklists work: Need specific items for common issues
- Load tests need realistic duration: 5min insufficient for leaks
- Gradual rollouts catch issues: 10% canary would have limited impact
- Monitoring gaps are dangerous: Add metrics before you need them
- Runbooks save time: Clear procedures enabled fast response
Related Incidents
- [2023-11-20] Database CPU spike (similar connection pool issue)
- [2023-08-15] Memory leak in cache layer
Prevention
To prevent similar incidents:
- ✅ Add connection management to code review checklist
- ✅ Monitor connection pool utilization
- ✅ Extend load test duration
- ✅ Implement canary deployments
- ✅ Add automated connection leak detection
Appendix
Monitoring Graphs
[Insert graphs of connection pool, error rates, latency during incident]
Communication Log
[Insert status page updates and customer communication]
Code Fix
PR #1235: Fix connection leak in user profile endpoint ```typescript const client = await pool.connect(); try { const user = await client.query('SELECT * FROM users WHERE id = $1', [id]); return user; } finally { client.release(); // ✅ Always release } ```
Postmortem Best Practices
Blameless Postmortem Guidelines
Do ✅
- Focus on systems and processes, not people
- Use timeline with exact timestamps
- Identify contributing factors, not just root cause
- Create actionable items with owners and dates
- Document what went well (positive reinforcement)
- Share widely for organizational learning
Don't ❌
- Blame individuals or teams
- Hide or minimize the incident
- Skip the postmortem (even for small incidents)
- Create action items without owners
- Forget to follow up on action items
- Make it a blame session
Template Sections
- Summary (2-3 sentences)
- Impact (numbers: users, revenue, duration)
- Timeline (chronological events)
- Root Cause (technical explanation)
- Contributing Factors (broader context)
- What Went Well (positive reinforcement)
- What Went Wrong (improvement areas)
- Action Items (concrete, owned, dated)
- Lessons Learned (key takeaways)
Output Checklist
-
Timeline created
-
Root cause identified
-
Contributing factors documented
-
Action items with owners
-
Lessons learned captured
-
Postmortem meeting held
-
Document shared widely
-
Follow-up scheduled ENDFILE