Postmortem Generator
Generate blameless postmortems that prevent repeat incidents. Compile timeline from alerts, chat logs, and metrics into a structured report with root cause analysis, contributing factors, and tracked action items — following Google SRE and Etsy blameless formats.
Use when: "write postmortem", "incident review", "blameless review", "what happened during the outage", "incident report", "post-incident review", or after any SEV1/SEV2 incident.
Commands
1. generate — Create Postmortem from Incident Data
Step 1: Gather Timeline Data
# PagerDuty incident timeline
curl -s "https://api.pagerduty.com/incidents/$INCIDENT_ID/log_entries" \
-H "Authorization: Token token=$PD_TOKEN" | python3 -c "
import json, sys
entries = json.load(sys.stdin)['log_entries']
for e in entries:
ts = e['created_at'][:19]
entry_type = e['type']
summary = e.get('summary', e.get('channel', {}).get('summary', ''))
print(f'{ts} [{entry_type}] {summary}')
"
# Alert history (Prometheus/Alertmanager)
curl -s "http://alertmanager:9093/api/v2/alerts?filter=incident_id=$INCIDENT_ID" | python3 -c "
import json, sys
alerts = json.load(sys.stdin)
for a in sorted(alerts, key=lambda x: x['startsAt']):
print(f'{a[\"startsAt\"][:19]} ALERT: {a[\"labels\"][\"alertname\"]} ({a[\"status\"]})')
"
# Git deploys around incident time
git log --since="$INCIDENT_START" --until="$INCIDENT_END" --oneline 2>/dev/null
Step 2: Analyze Root Cause
Use the "5 Whys" technique:
- Why did the service go down? → Database connection pool exhausted
- Why was the pool exhausted? → Slow queries holding connections
- Why were queries slow? → Missing index on new column
- Why was the index missing? → Migration didn't include it
- Why wasn't this caught? → No query performance tests in CI
Identify:
- Root cause: The deepest "why" that's actionable
- Contributing factors: Things that made it worse (no alerting, manual process, missing runbook)
- Mitigating factors: Things that helped (quick detection, good rollback process)
Step 3: Generate Postmortem Document
# Incident Postmortem: [Title]
**Date:** [YYYY-MM-DD]
**Duration:** [Xh Ym]
**Severity:** SEV-[1/2/3]
**Author:** [Name]
**Status:** Draft / Reviewed / Complete
## Summary
[2-3 sentences: what happened, impact, how resolved]
## Impact
- **Users affected:** [number or percentage]
- **Revenue impact:** [estimated if applicable]
- **Duration:** [from detection to resolution]
- **Services affected:** [list]
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:23 | Deploy of commit abc123 to production |
| 14:31 | Alert: API error rate > 5% |
| 14:33 | On-call acknowledged, began investigation |
| 14:41 | Identified slow database queries |
| 14:45 | Decision: rollback deploy |
| 14:48 | Rollback complete |
| 14:52 | Error rate returned to baseline |
| 14:55 | Confirmed: all systems nominal |
## Root Cause
[Clear explanation of what broke and why, without blame]
## Contributing Factors
- [Factor 1: e.g., no query performance testing in CI]
- [Factor 2: e.g., alert threshold was too high, delayed detection by 8 min]
- [Factor 3: e.g., runbook for DB issues was outdated]
## What Went Well
- Quick detection (8 min from deploy to alert)
- Rollback was smooth (3 min)
- Good communication in incident channel
## What Went Wrong
- No pre-deploy performance check would have caught the missing index
- Alert threshold of 5% was too high — impact started at 1%
- Took 10 min to identify root cause (no slow query dashboard)
## Action Items
| Priority | Action | Owner | Due | Status |
|----------|--------|-------|-----|--------|
| P1 | Add migration linter to CI (check for missing indexes) | @alice | 2026-05-05 | TODO |
| P1 | Lower error rate alert threshold to 1% | @bob | 2026-05-01 | TODO |
| P2 | Add slow query dashboard to Grafana | @carol | 2026-05-10 | TODO |
| P2 | Update DB incident runbook | @dave | 2026-05-07 | TODO |
| P3 | Add query performance tests to staging deploy | @alice | 2026-05-20 | TODO |
## Lessons Learned
[What did we learn that applies beyond this specific incident?]
2. review — Facilitate Blameless Review
Generate review meeting agenda:
- Timeline walkthrough (facts only, no blame)
- What surprised us?
- Where did our assumptions fail?
- What would have prevented this?
- Action item assignment and prioritization
3. track — Follow Up on Action Items
Check status of postmortem action items:
- Which action items from recent postmortems are still open?
- Are we repeating the same root causes? (cluster analysis)
- Average time to close action items by priority
- Incidents that could have been prevented by completed action items