Postmortem Generator

Generate blameless postmortems that prevent repeat incidents. Compile timeline from alerts, chat logs, and metrics into a structured report with root cause analysis, contributing factors, and tracked action items — following Google SRE and Etsy blameless formats.

Use when: "write postmortem", "incident review", "blameless review", "what happened during the outage", "incident report", "post-incident review", or after any SEV1/SEV2 incident.

Commands

1. `generate` — Create Postmortem from Incident Data

Step 1: Gather Timeline Data

# PagerDuty incident timeline
curl -s "https://api.pagerduty.com/incidents/$INCIDENT_ID/log_entries" \
  -H "Authorization: Token token=$PD_TOKEN" | python3 -c "
import json, sys
entries = json.load(sys.stdin)['log_entries']
for e in entries:
    ts = e['created_at'][:19]
    entry_type = e['type']
    summary = e.get('summary', e.get('channel', {}).get('summary', ''))
    print(f'{ts} [{entry_type}] {summary}')
"

# Alert history (Prometheus/Alertmanager)
curl -s "http://alertmanager:9093/api/v2/alerts?filter=incident_id=$INCIDENT_ID" | python3 -c "
import json, sys
alerts = json.load(sys.stdin)
for a in sorted(alerts, key=lambda x: x['startsAt']):
    print(f'{a[\"startsAt\"][:19]} ALERT: {a[\"labels\"][\"alertname\"]} ({a[\"status\"]})')
"

# Git deploys around incident time
git log --since="$INCIDENT_START" --until="$INCIDENT_END" --oneline 2>/dev/null

Step 2: Analyze Root Cause

Use the "5 Whys" technique:

Why did the service go down? → Database connection pool exhausted
Why was the pool exhausted? → Slow queries holding connections
Why were queries slow? → Missing index on new column
Why was the index missing? → Migration didn't include it
Why wasn't this caught? → No query performance tests in CI

Identify:

Root cause: The deepest "why" that's actionable
Contributing factors: Things that made it worse (no alerting, manual process, missing runbook)
Mitigating factors: Things that helped (quick detection, good rollback process)

Step 3: Generate Postmortem Document

# Incident Postmortem: [Title]

**Date:** [YYYY-MM-DD]
**Duration:** [Xh Ym]
**Severity:** SEV-[1/2/3]
**Author:** [Name]
**Status:** Draft / Reviewed / Complete

## Summary
[2-3 sentences: what happened, impact, how resolved]

## Impact
- **Users affected:** [number or percentage]
- **Revenue impact:** [estimated if applicable]
- **Duration:** [from detection to resolution]
- **Services affected:** [list]

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| 14:23 | Deploy of commit abc123 to production |
| 14:31 | Alert: API error rate > 5% |
| 14:33 | On-call acknowledged, began investigation |
| 14:41 | Identified slow database queries |
| 14:45 | Decision: rollback deploy |
| 14:48 | Rollback complete |
| 14:52 | Error rate returned to baseline |
| 14:55 | Confirmed: all systems nominal |

## Root Cause
[Clear explanation of what broke and why, without blame]

## Contributing Factors
- [Factor 1: e.g., no query performance testing in CI]
- [Factor 2: e.g., alert threshold was too high, delayed detection by 8 min]
- [Factor 3: e.g., runbook for DB issues was outdated]

## What Went Well
- Quick detection (8 min from deploy to alert)
- Rollback was smooth (3 min)
- Good communication in incident channel

## What Went Wrong
- No pre-deploy performance check would have caught the missing index
- Alert threshold of 5% was too high — impact started at 1%
- Took 10 min to identify root cause (no slow query dashboard)

## Action Items
| Priority | Action | Owner | Due | Status |
|----------|--------|-------|-----|--------|
| P1 | Add migration linter to CI (check for missing indexes) | @alice | 2026-05-05 | TODO |
| P1 | Lower error rate alert threshold to 1% | @bob | 2026-05-01 | TODO |
| P2 | Add slow query dashboard to Grafana | @carol | 2026-05-10 | TODO |
| P2 | Update DB incident runbook | @dave | 2026-05-07 | TODO |
| P3 | Add query performance tests to staging deploy | @alice | 2026-05-20 | TODO |

## Lessons Learned
[What did we learn that applies beyond this specific incident?]

2. `review` — Facilitate Blameless Review

Generate review meeting agenda:

Timeline walkthrough (facts only, no blame)
What surprised us?
Where did our assumptions fail?
What would have prevented this?
Action item assignment and prioritization

3. `track` — Follow Up on Action Items

Check status of postmortem action items:

Which action items from recent postmortems are still open?
Are we repeating the same root causes? (cluster analysis)
Average time to close action items by priority
Incidents that could have been prevented by completed action items