postmortem-writer

Document incidents for learning and improvement.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "postmortem-writer" with this command: npx skills add monkey1sai/openai-cli/monkey1sai-openai-cli-postmortem-writer

Postmortem Writer

Document incidents for learning and improvement.

Postmortem Template

Postmortem: API Outage - Database Connection Pool Exhausted

Date: 2024-01-15 Authors: Jane Doe (On-call), John Smith (DBA) Status: Complete Severity: P1 (Critical)

Summary

On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.

Impact:

  • Duration: 25 minutes
  • Users affected: ~50,000
  • Requests failed: ~125,000
  • Revenue impact: ~$15,000

Timeline (All times UTC)

TimeEvent
14:15v2.3.4 deployed to production
14:32First CloudWatch alarm: HighErrorRate
14:33PagerDuty alert sent to on-call (Jane)
14:35Jane acknowledges, begins investigation
14:38Identified: Database connection pool at 100%
14:40Attempted: Kill long-running queries (no effect)
14:43Decision: Rollback to v2.3.3
14:45Rollback initiated
14:47Rollback complete, connections dropping
14:50Error rate returning to normal
14:57All systems recovered, incident closed
15:30Postmortem meeting scheduled

Root Cause

A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.

Code diff: ```diff

  • await prisma.user.findUnique({ where: { id } });
  • const client = await pool.connect();
  • const user = await client.query('SELECT * FROM users WHERE id = $1', [id]);
  • // Missing: client.release() ❌ ```

Contributing Factors

  1. Insufficient testing: Load tests didn't catch the leak

    • Tests only ran for 5 minutes
    • Not enough concurrent connections to exhaust pool
  2. Missing monitoring: No alerts on connection pool metrics

    • Had alarms for query latency
    • No alarms for active connections count
  3. Inadequate code review: Reviewer didn't spot missing release()

    • PR approved without running locally
    • No checklist for connection management
  4. Deployment process: No gradual rollout

    • Deployed to 100% of production immediately
    • No canary deployment

What Went Well

  1. Fast detection: Alert fired within 3 minutes
  2. Clear runbook: DBA runbook had exact steps to follow
  3. Quick decision: Made rollback decision in 8 minutes
  4. Communication: Status page updated every 5 minutes
  5. Rollback capability: Automated rollback took <2 minutes

What Went Wrong

  1. Code review missed bug: Connection leak not caught
  2. Testing gaps: Load tests insufficient duration
  3. No canary: Deployed to all instances at once
  4. Late detection: 17 minutes between deploy and alert

Action Items

ActionOwnerDue DatePriorityStatus
Add connection pool metrics to dashboardsJane2024-01-20P0✅ Done
Create PR checklist for connection managementJohn2024-01-22P0✅ Done
Extend load tests to 30 minutes minimumQA Team2024-01-25P1🔄 In Progress
Implement canary deployment (10% → 100%)DevOps2024-02-01P1📋 Planned
Add connection leak detection to testsJane2024-01-27P1🔄 In Progress
Review all DB connection usage patternsJohn2024-02-05P2📋 Planned
Improve alert routing (faster escalation)DevOps2024-02-10P2📋 Planned

Lessons Learned

  1. Code review checklists work: Need specific items for common issues
  2. Load tests need realistic duration: 5min insufficient for leaks
  3. Gradual rollouts catch issues: 10% canary would have limited impact
  4. Monitoring gaps are dangerous: Add metrics before you need them
  5. Runbooks save time: Clear procedures enabled fast response

Related Incidents

  • [2023-11-20] Database CPU spike (similar connection pool issue)
  • [2023-08-15] Memory leak in cache layer

Prevention

To prevent similar incidents:

  1. ✅ Add connection management to code review checklist
  2. ✅ Monitor connection pool utilization
  3. ✅ Extend load test duration
  4. ✅ Implement canary deployments
  5. ✅ Add automated connection leak detection

Appendix

Monitoring Graphs

[Insert graphs of connection pool, error rates, latency during incident]

Communication Log

[Insert status page updates and customer communication]

Code Fix

PR #1235: Fix connection leak in user profile endpoint ```typescript const client = await pool.connect(); try { const user = await client.query('SELECT * FROM users WHERE id = $1', [id]); return user; } finally { client.release(); // ✅ Always release } ```

Postmortem Best Practices

Blameless Postmortem Guidelines

Do ✅

  • Focus on systems and processes, not people
  • Use timeline with exact timestamps
  • Identify contributing factors, not just root cause
  • Create actionable items with owners and dates
  • Document what went well (positive reinforcement)
  • Share widely for organizational learning

Don't ❌

  • Blame individuals or teams
  • Hide or minimize the incident
  • Skip the postmortem (even for small incidents)
  • Create action items without owners
  • Forget to follow up on action items
  • Make it a blame session

Template Sections

  1. Summary (2-3 sentences)
  2. Impact (numbers: users, revenue, duration)
  3. Timeline (chronological events)
  4. Root Cause (technical explanation)
  5. Contributing Factors (broader context)
  6. What Went Well (positive reinforcement)
  7. What Went Wrong (improvement areas)
  8. Action Items (concrete, owned, dated)
  9. Lessons Learned (key takeaways)

Output Checklist

  • Timeline created

  • Root cause identified

  • Contributing factors documented

  • Action items with owners

  • Lessons learned captured

  • Postmortem meeting held

  • Document shared widely

  • Follow-up scheduled ENDFILE

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

eslint-prettier-config

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

api-docs-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

rate-limiting-abuse-protection

No summary provided by upstream source.

Repository SourceNeeds Review