Mode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.

Postmortem Writing

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

When to Use This Skill

Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items
Building organizational learning culture

Core Concepts

Blameless Culture

Blame-Focused Blameless

"Who caused this?" "What conditions allowed this?"

"Someone made a mistake" "The system allowed this mistake"

Punish individuals Improve systems

Hide information Share learnings

Fear of speaking up Psychological safety

Postmortem Triggers

SEV1 or SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes
Incidents requiring unusual intervention

Quick Start

Postmortem Timeline

Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents

Templates

Template 1: Standard Postmortem

Postmortem: [Incident Title]

Date: 2024-01-15 Authors: @alice, @bob Status: Draft | In Review | Final Incident Severity: SEV2 Incident Duration: 47 minutes

Executive Summary

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

Impact:

12,000 customers unable to complete purchases
Estimated revenue loss: $45,000
847 support tickets created
No data loss or security implications

Timeline (All times UTC)

Time	Event
14:23	Deployment v2.3.4 completed to production
14:31	First alert: `payment_error_rate > 5%`
14:33	On-call engineer @alice acknowledges alert
14:35	Initial investigation begins, error rate at 23%
14:41	Incident declared SEV2, @bob joins
14:45	Database connection exhaustion identified
14:52	Decision to rollback deployment
14:58	Rollback to v2.3.3 initiated
15:10	Rollback complete, error rate dropping
15:18	Service fully recovered, incident resolved

Root Cause Analysis

What Happened

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.

Why It Happened

Proximate Cause: Code change in PaymentRepository.java replaced pooled DataSource with direct DriverManager.getConnection() calls.
Contributing Factors:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
5 Whys Analysis:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns

System Diagram

[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)

Detection

What Worked

Error rate alert fired within 8 minutes of deployment
Grafana dashboard clearly showed connection spike
On-call response was swift (2 minute acknowledgment)

What Didn't Work

Database connection metric alert threshold too high
No deployment-correlated alerting
Canary deployment would have caught this earlier

Detection Gap

The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.

Response

What Worked

On-call engineer quickly identified database as the issue
Rollback decision was made decisively
Clear communication in incident channel

What Could Be Improved

Took 10 minutes to correlate issue with recent deployment
Had to manually check deployment history
Rollback took 12 minutes (could be faster)

Impact

Customer Impact

12,000 unique customers affected
Average impact duration: 35 minutes
847 support tickets (23% of affected users)
Customer satisfaction score dropped 12 points

Business Impact

Estimated revenue loss: $45,000
Support cost: ~$2,500 (agent time)
Engineering time: ~8 person-hours

Technical Impact

Database primary experienced elevated load
Some replica lag during incident
No permanent damage to systems

Lessons Learned

What Went Well

Alerting detected the issue before customer reports
Team collaborated effectively under pressure
Rollback procedure worked smoothly
Communication was clear and timely

What Went Wrong

Code review missed critical change
Test coverage gap for connection pooling
Staging environment doesn't reflect production traffic
Alert thresholds were not tuned properly

Where We Got Lucky

Incident occurred during business hours with full team available
Database handled the load without failing completely
No other incidents occurred simultaneously

Action Items

Priority	Action	Owner	Due Date	Ticket
P0	Add integration test for connection pool behavior	@alice	2024-01-22	ENG-1234
P0	Lower database connection alert threshold to 70%	@bob	2024-01-17	OPS-567
P1	Document connection management patterns	@alice	2024-01-29	DOC-89
P1	Implement deployment-correlated alerting	@bob	2024-02-05	OPS-568
P2	Evaluate canary deployment strategy	@charlie	2024-02-15	ENG-1235
P2	Load test staging with production-like traffic	@dave	2024-02-28	QA-123

Appendix

Supporting Data

Error Rate Graph

[Link to Grafana dashboard snapshot]

Database Connection Graph

[Link to metrics]

Related Incidents

2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)

References

Template 2: 5 Whys Analysis

5 Whys Analysis: [Incident]

Problem Statement

Payment service experienced 47-minute outage due to database connection exhaustion.

Analysis

Why #1: Why did the service fail?

Answer: Database connections were exhausted, causing all new requests to fail.

Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

Why #2: Why were database connections exhausted?

Answer: Each incoming request opened a new database connection instead of using the connection pool.

Evidence: Code diff shows direct DriverManager.getConnection() instead of pooled DataSource.

Why #3: Why did the code bypass the connection pool?

Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.

Evidence: PR #1234 shows the change, made while fixing a different bug.

Why #4: Why wasn't this caught in code review?

Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

Evidence: Review comments only discuss business logic.

Why #5: Why isn't there a safety net for this type of change?

Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.

Root Causes Identified

Primary: Missing automated tests for infrastructure behavior
Secondary: Insufficient documentation of architectural patterns
Tertiary: Code review checklist doesn't include infrastructure considerations

Systemic Improvements

Root Cause	Improvement	Type
Missing tests	Add infrastructure behavior tests	Prevention
Missing docs	Document connection patterns	Prevention
Review gaps	Update review checklist	Detection
No canary	Implement canary deployments	Mitigation

Template 3: Quick Postmortem (Minor Incidents)

Quick Postmortem: [Brief Title]

Date: 2024-01-15 | Duration: 12 min | Severity: SEV3

What Happened

API latency spiked to 5s due to cache miss storm after cache flush.

Timeline

10:00 - Cache flush initiated for config update
10:02 - Latency alerts fire
10:05 - Identified as cache miss storm
10:08 - Enabled cache warming
10:12 - Latency normalized

Root Cause

Full cache flush for minor config update caused thundering herd.

Fix

Immediate: Enabled cache warming
Long-term: Implement partial cache invalidation (ENG-999)

Lessons

Don't full-flush cache in production; use targeted invalidation.

Facilitation Guide

Running a Postmortem Meeting

Meeting Structure (60 minutes)

1. Opening (5 min)

Remind everyone of blameless culture
"We're here to learn, not to blame"
Review meeting norms

2. Timeline Review (15 min)

Walk through events chronologically
Ask clarifying questions
Identify gaps in timeline

3. Analysis Discussion (20 min)

What failed?
Why did it fail?
What conditions allowed this?
What would have prevented it?

4. Action Items (15 min)

Brainstorm improvements
Prioritize by impact and effort
Assign owners and due dates

5. Closing (5 min)

Summarize key learnings
Confirm action item owners
Schedule follow-up if needed

Facilitation Tips

Keep discussion on track
Redirect blame to systems
Encourage quiet participants
Document dissenting views
Time-box tangents

Iron Laws

ALWAYS write the postmortem within 48 hours of incident resolution — memory fades and operational context is permanently lost after that window.
NEVER blame individuals in a postmortem — focus on system conditions, process gaps, and tooling failures that allowed the incident to occur.
ALWAYS trace root cause through at least 3 levels of "why" before concluding — proximate causes are symptoms, not the systemic roots that need fixing.
ALWAYS assign every action item a specific owner, priority level, and due date — unowned or undated action items are never completed.
NEVER skip the "What Went Well" section — reinforcing positive behaviors and successful responses is as critical as fixing the failures.

Anti-Patterns to Avoid

Anti-Pattern Problem Better Approach

Blame game Shuts down learning Focus on systems

Shallow analysis Doesn't prevent recurrence Ask "why" 5 times

No action items Waste of time Always have concrete next steps

Unrealistic actions Never completed Scope to achievable tasks

No follow-up Actions forgotten Track in ticketing system

Best Practices

Do's

Start immediately - Memory fades fast
Be specific - Exact times, exact errors
Include graphs - Visual evidence
Assign owners - No orphan action items
Share widely - Organizational learning

Don'ts

Don't name and shame - Ever
Don't skip small incidents - They reveal patterns
Don't make it a blame doc - That kills learning
Don't create busywork - Actions should be meaningful
Don't skip follow-up - Verify actions completed

Resources

Google SRE - Postmortem Culture
Etsy's Blameless Postmortems
PagerDuty Postmortem Guide

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

New pattern -> .claude/context/memory/learnings.md
Issue found -> .claude/context/memory/issues.md
Decision made -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

postmortem-writing

Safety Notice

Copy this and send it to your AI assistant to learn

Postmortem: [Incident Title]

Executive Summary

Timeline (All times UTC)

Root Cause Analysis

What Happened

Why It Happened

System Diagram

Detection

What Worked

What Didn't Work

Detection Gap

Response

What Worked

What Could Be Improved

Impact

Customer Impact

Business Impact

Technical Impact

Lessons Learned

What Went Well

What Went Wrong

Where We Got Lucky

Action Items

Appendix

Supporting Data

Error Rate Graph

Database Connection Graph

Related Incidents

References

5 Whys Analysis: [Incident]

Problem Statement

Analysis

Why #1: Why did the service fail?

Why #2: Why were database connections exhausted?

Why #3: Why did the code bypass the connection pool?

Why #4: Why wasn't this caught in code review?

Why #5: Why isn't there a safety net for this type of change?

Root Causes Identified

Systemic Improvements

Quick Postmortem: [Brief Title]

What Happened

Timeline

Root Cause

Fix

Lessons

Meeting Structure (60 minutes)

1. Opening (5 min)

2. Timeline Review (15 min)

3. Analysis Discussion (20 min)

4. Action Items (15 min)

5. Closing (5 min)

Facilitation Tips

Source Transparency

Related Skills

filesystem

slack-notifications

chrome-browser