Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

Need Go To

Define reliability targets SLOs & Error Budgets

Write incident report Postmortem Templates

Set up SLO alerting references/slo-alerting.md

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

Availability: successful requests / total requests

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

Latency: requests below threshold / total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

SLO % Downtime/Month Downtime/Year

99% 7.2 hours 3.65 days

99.9% 43 minutes 8.76 hours

99.95% 22 minutes 4.38 hours

99.99% 4.3 minutes 52 minutes

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy:

Budget Remaining Action

50% Normal velocity

10-50% Postpone risky changes

< 10% Freeze non-critical changes

0% Feature freeze, fix reliability

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

Blame-Focused Blameless

"Who caused this?" "What conditions allowed this?"

Punish individuals Improve systems

Hide information Share learnings

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

Postmortem: [Incident Title]

Date: YYYY-MM-DD | Duration: X min | Severity: SEVX

Executive Summary

One paragraph: what happened, impact, root cause, resolution.

Timeline (UTC)

Time	Event
HH:MM	First alert fired
HH:MM	On-call acknowledged
HH:MM	Root cause identified
HH:MM	Fix deployed
HH:MM	Service recovered

Root Cause Analysis

5 Whys

Why did service fail? → [Answer]
Why did [1] happen? → [Answer]
Why did [2] happen? → [Answer]
Why did [3] happen? → [Answer]
Why did [4] happen? → [Root cause]

Impact

Customers affected: X
Duration: X minutes
Revenue impact: $X
Support tickets: X

Action Items

Priority	Action	Owner	Due	Ticket
P0	[Immediate fix]	@name	Date	XXX-123
P1	[Prevent recurrence]	@name	Date	XXX-124
P2	[Improve detection]	@name	Date	XXX-125

Quick Template (Minor Incidents)

Quick Postmortem: [Title]

Date: YYYY-MM-DD | Duration: X min | Severity: SEV3

What Happened

One sentence description.

Timeline

HH:MM - Trigger
HH:MM - Detection
HH:MM - Resolution

Root Cause

One sentence.

Fix

Immediate: [What was done]
Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

Don't Do Instead

Aim for 100% SLO Accept error budget exists

Skip small incidents Small incidents reveal patterns

Orphan action items Every item needs owner + date + ticket

Blame individuals Ask "what conditions allowed this?"

Create busywork actions Actions should prevent recurrence

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

operating-production-services

Safety Notice

Copy this and send it to your AI assistant to learn