Mode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.
Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
When to Use This Skill
-
Conducting post-incident reviews
-
Writing postmortem documents
-
Facilitating blameless postmortem meetings
-
Identifying root causes and contributing factors
-
Creating actionable follow-up items
-
Building organizational learning culture
Core Concepts
- Blameless Culture
Blame-Focused Blameless
"Who caused this?" "What conditions allowed this?"
"Someone made a mistake" "The system allowed this mistake"
Punish individuals Improve systems
Hide information Share learnings
Fear of speaking up Psychological safety
- Postmortem Triggers
-
SEV1 or SEV2 incidents
-
Customer-facing outages > 15 minutes
-
Data loss or security incidents
-
Near-misses that could have been severe
-
Novel failure modes
-
Incidents requiring unusual intervention
Quick Start
Postmortem Timeline
Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents
Templates
Template 1: Standard Postmortem
Postmortem: [Incident Title]
Date: 2024-01-15 Authors: @alice, @bob Status: Draft | In Review | Final Incident Severity: SEV2 Incident Duration: 47 minutes
Executive Summary
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
Impact:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications
Timeline (All times UTC)
| Time | Event |
|---|---|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: payment_error_rate > 5% |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
Root Cause Analysis
What Happened
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
Why It Happened
-
Proximate Cause: Code change in
PaymentRepository.javareplaced pooledDataSourcewith directDriverManager.getConnection()calls. -
Contributing Factors:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
-
5 Whys Analysis:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns
System Diagram
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
Detection
What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)
What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier
Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
Response
What Worked
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel
What Could Be Improved
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)
Impact
Customer Impact
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points
Business Impact
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours
Technical Impact
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems
Lessons Learned
What Went Well
- Alerting detected the issue before customer reports
- Team collaborated effectively under pressure
- Rollback procedure worked smoothly
- Communication was clear and timely
What Went Wrong
- Code review missed critical change
- Test coverage gap for connection pooling
- Staging environment doesn't reflect production traffic
- Alert thresholds were not tuned properly
Where We Got Lucky
- Incident occurred during business hours with full team available
- Database handled the load without failing completely
- No other incidents occurred simultaneously
Action Items
| Priority | Action | Owner | Due Date | Ticket |
|---|---|---|---|---|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
Appendix
Supporting Data
Error Rate Graph
[Link to Grafana dashboard snapshot]
Database Connection Graph
[Link to metrics]
Related Incidents
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
References
Template 2: 5 Whys Analysis
5 Whys Analysis: [Incident]
Problem Statement
Payment service experienced 47-minute outage due to database connection exhaustion.
Analysis
Why #1: Why did the service fail?
Answer: Database connections were exhausted, causing all new requests to fail.
Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
Why #2: Why were database connections exhausted?
Answer: Each incoming request opened a new database connection instead of using the connection pool.
Evidence: Code diff shows direct DriverManager.getConnection() instead of pooled DataSource.
Why #3: Why did the code bypass the connection pool?
Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.
Evidence: PR #1234 shows the change, made while fixing a different bug.
Why #4: Why wasn't this caught in code review?
Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
Evidence: Review comments only discuss business logic.
Why #5: Why isn't there a safety net for this type of change?
Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.
Root Causes Identified
- Primary: Missing automated tests for infrastructure behavior
- Secondary: Insufficient documentation of architectural patterns
- Tertiary: Code review checklist doesn't include infrastructure considerations
Systemic Improvements
| Root Cause | Improvement | Type |
|---|---|---|
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
Template 3: Quick Postmortem (Minor Incidents)
Quick Postmortem: [Brief Title]
Date: 2024-01-15 | Duration: 12 min | Severity: SEV3
What Happened
API latency spiked to 5s due to cache miss storm after cache flush.
Timeline
- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized
Root Cause
Full cache flush for minor config update caused thundering herd.
Fix
- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)
Lessons
Don't full-flush cache in production; use targeted invalidation.
Facilitation Guide
Running a Postmortem Meeting
Meeting Structure (60 minutes)
1. Opening (5 min)
- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms
2. Timeline Review (15 min)
- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline
3. Analysis Discussion (20 min)
- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?
4. Action Items (15 min)
- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates
5. Closing (5 min)
- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed
Facilitation Tips
- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents
Iron Laws
-
ALWAYS write the postmortem within 48 hours of incident resolution — memory fades and operational context is permanently lost after that window.
-
NEVER blame individuals in a postmortem — focus on system conditions, process gaps, and tooling failures that allowed the incident to occur.
-
ALWAYS trace root cause through at least 3 levels of "why" before concluding — proximate causes are symptoms, not the systemic roots that need fixing.
-
ALWAYS assign every action item a specific owner, priority level, and due date — unowned or undated action items are never completed.
-
NEVER skip the "What Went Well" section — reinforcing positive behaviors and successful responses is as critical as fixing the failures.
Anti-Patterns to Avoid
Anti-Pattern Problem Better Approach
Blame game Shuts down learning Focus on systems
Shallow analysis Doesn't prevent recurrence Ask "why" 5 times
No action items Waste of time Always have concrete next steps
Unrealistic actions Never completed Scope to achievable tasks
No follow-up Actions forgotten Track in ticketing system
Best Practices
Do's
-
Start immediately - Memory fades fast
-
Be specific - Exact times, exact errors
-
Include graphs - Visual evidence
-
Assign owners - No orphan action items
-
Share widely - Organizational learning
Don'ts
-
Don't name and shame - Ever
-
Don't skip small incidents - They reveal patterns
-
Don't make it a blame doc - That kills learning
-
Don't create busywork - Actions should be meaningful
-
Don't skip follow-up - Verify actions completed
Resources
-
Google SRE - Postmortem Culture
-
Etsy's Blameless Postmortems
-
PagerDuty Postmortem Guide
Memory Protocol (MANDATORY)
Before starting: Read .claude/context/memory/learnings.md
After completing:
-
New pattern -> .claude/context/memory/learnings.md
-
Issue found -> .claude/context/memory/issues.md
-
Decision made -> .claude/context/memory/decisions.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.