Observability & Reliability Engineering
Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.
Quick Health Check (/16)
Score your current observability posture:
| Signal | Healthy (2) | Weak (1) | Missing (0) |
|---|
| Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements |
| Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics |
| Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing |
| Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise |
| Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" |
| SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets |
| On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call |
| Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |
12-16: Production-grade. Focus on optimization.
8-11: Foundation exists. Fill the gaps systematically.
4-7: Significant risk. Prioritize alerting + incident response.
0-3: Flying blind. Start with Phase 1 immediately.
Phase 1: Structured Logging
Log Architecture
Application → Structured JSON → Log Router → Storage → Query Engine
↓
Alert Pipeline
Required Fields (Every Log Line)
| Field | Type | Purpose | Example |
|---|
timestamp | ISO-8601 UTC | When | 2026-02-22T18:30:00.123Z |
level | enum | Severity | info, warn, error, fatal |
service | string | Which service | payment-api |
version | string | Which deploy | v2.3.1 |
environment | string | Which env | production |
message | string | What happened | Payment processed successfully |
trace_id | string | Request correlation | abc123def456 |
span_id | string | Operation within trace | span_789 |
duration_ms | number | How long | 142 |
Contextual Fields (Add Per Domain)
# HTTP request context
http:
method: POST
path: /api/v1/orders
status: 201
client_ip: 203.0.113.42 # Anonymize in logs if needed
user_agent: "Mozilla/5.0..."
request_id: "req_abc123"
# Business context
business:
user_id: "usr_456"
tenant_id: "tenant_789"
order_id: "ord_012"
action: "checkout"
amount_cents: 4999
currency: "USD"
# Error context
error:
type: "PaymentDeclinedError"
message: "Card declined: insufficient funds"
code: "CARD_DECLINED"
stack: "..." # Only in non-production or DEBUG level
retry_count: 2
retryable: true
Log Level Decision Tree
Is the process about to crash?
→ FATAL (exit after logging)
Did an operation fail that needs human attention?
→ ERROR (page someone or create ticket)
Did something unexpected happen but we recovered?
→ WARN (review in daily triage)
Is this a normal business event worth recording?
→ INFO (audit trail, business metrics)
Is this useful for debugging but noisy in production?
→ DEBUG (off in prod, on in staging)
Is this only useful when stepping through code?
→ TRACE (never in production)
Log Level Rules
- ERROR means action required — if no one needs to act on it, it's WARN
- INFO is for business events — not internal implementation details
- No logging inside tight loops — aggregate and log summary
- Log at boundaries — API entry/exit, queue consume/publish, DB calls
- Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)
PII & Secret Scrubbing
scrub_patterns:
# Always redact
- field_patterns: ["password", "secret", "token", "api_key", "authorization"]
action: replace_with_redacted
# Hash for correlation without exposure
- field_patterns: ["email", "phone", "ssn", "national_id"]
action: sha256_hash
# Mask partially
- field_patterns: ["credit_card", "card_number"]
action: mask_last_4 # "****-****-****-1234"
# IP anonymization
- field_patterns: ["client_ip", "ip_address"]
action: zero_last_octet # 203.0.113.0
Logger Setup (By Language)
Node.js (Pino):
import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';
const als = new AsyncLocalStorage<Record<string, string>>();
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
mixin: () => als.getStore() ?? {},
redact: ['req.headers.authorization', '*.password', '*.token'],
timestamp: pino.stdTimeFunctions.isoTime,
});
// Middleware: inject context
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
service: 'payment-api',
version: process.env.APP_VERSION,
};
als.run(ctx, () => next());
});
Python (structlog):
import structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)
Go (zerolog):
log := zerolog.New(os.Stdout).With().
Timestamp().
Str("service", "payment-api").
Str("version", version).
Logger()
// Per-request:
reqLog := log.With().Str("trace_id", traceID).Logger()
Log Storage Decision
| Volume | Solution | Retention | Cost |
|---|
| <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low |
| 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium |
| 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High |
| Budget-constrained | Loki + S3 backend | 90 days all cold | Very low |
10 Logging Anti-Patterns
| # | Anti-Pattern | Fix |
|---|
| 1 | log.error(err) with no context | Always include: what operation, what input, what state |
| 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields |
| 3 | String concatenation in log messages | Use structured fields: log.info("processed", { order_id, amount }) |
| 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer |
| 5 | Different log formats per service | Standardize schema across all services |
| 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage |
| 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval |
| 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services |
| 9 | Boolean log levels (verbose: true) | Use standard levels with configurable minimum |
| 10 | Logging PII in plain text | Implement scrubbing at the logger level |
Phase 2: Metrics Collection
The RED Method (Request-Driven Services)
For every service endpoint, track:
| Metric | What | Prometheus Example |
|---|
| Rate | Requests per second | http_requests_total{method, path, status} |
| Errors | Failed requests per second | http_requests_total{status=~"5.."} / total |
| Duration | Latency distribution | http_request_duration_seconds{method, path} (histogram) |
The USE Method (Infrastructure Resources)
For every resource (CPU, memory, disk, network):
| Metric | What | Example |
|---|
| Utilization | % resource busy | CPU usage 78% |
| Saturation | Queue depth / backpressure | 12 requests queued |
| Errors | Resource errors | 3 disk I/O errors |
Golden Signals (Google SRE)
| Signal | Meaning | Source |
|---|
| Latency | Time to serve requests | RED Duration |
| Traffic | Demand on the system | RED Rate |
| Errors | Rate of failed requests | RED Errors |
| Saturation | How "full" the service is | USE Saturation |
Metric Types & When to Use Each
| Type | Use Case | Example |
|---|
| Counter | Things that only go up | Total requests, errors, bytes sent |
| Gauge | Current value that goes up/down | Active connections, queue depth, temperature |
| Histogram | Distribution of values | Request latency, response size |
| Summary | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) |
Rule: Use histograms over summaries in most cases — they're aggregatable across instances.
Naming Conventions
# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio
# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)
Label Design Rules
| Rule | Why | Example |
|---|
| Keep cardinality <100 per label | High cardinality kills performance | status="200" not status="200 OK" |
| No user IDs as labels | Unbounded cardinality | Use log correlation instead |
| No request paths with IDs | /api/users/123 creates millions of series | Normalize: /api/users/:id |
| Max 5-7 labels per metric | Each combo = a time series | {method, path, status, service} |
Instrumentation Checklist
application_metrics:
# HTTP layer
- http_request_duration_seconds: histogram {method, path, status}
- http_request_size_bytes: histogram {method, path}
- http_response_size_bytes: histogram {method, path}
- http_requests_in_flight: gauge
# Business logic
- orders_processed_total: counter {status, payment_method}
- order_value_dollars: histogram {payment_method}
- user_signups_total: counter {source}
# Dependencies
- db_query_duration_seconds: histogram {query_type, table}
- db_connections_active: gauge {pool}
- db_connections_idle: gauge {pool}
- cache_requests_total: counter {result: hit|miss}
- external_api_duration_seconds: histogram {service, endpoint}
- external_api_errors_total: counter {service, error_type}
# Queue / async
- queue_messages_published_total: counter {queue}
- queue_messages_consumed_total: counter {queue, status}
- queue_processing_duration_seconds: histogram {queue}
- queue_depth: gauge {queue}
- queue_consumer_lag: gauge {queue, consumer_group}
infrastructure_metrics:
# Node exporter / cAdvisor provides these automatically
- cpu_usage_percent: gauge {instance}
- memory_usage_bytes: gauge {instance}
- disk_usage_bytes: gauge {instance, mount}
- disk_io_seconds: counter {instance, device}
- network_bytes: counter {instance, direction}
- container_cpu_usage: gauge {pod, container}
- container_memory_usage: gauge {pod, container}
Stack Recommendations
| Component | Options | Recommendation |
|---|
| Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) |
| Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) |
| Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) |
| Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing |
Phase 3: Distributed Tracing
Trace Architecture
Client Request
→ API Gateway (root span)
→ Auth Service (child span)
→ Order Service (child span)
→ Database Query (child span)
→ Payment Service (child span)
→ Stripe API (child span)
→ Notification Service (child span)
→ Email Provider (child span)
OpenTelemetry Setup
Auto-instrumentation (Node.js):
// tracing.ts — import BEFORE anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
'@opentelemetry/instrumentation-express': { enabled: true },
})],
serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();
Custom spans for business logic:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
async function processPayment(order: Order) {
return tracer.startActiveSpan('process-payment', async (span) => {
span.setAttributes({
'order.id': order.id,
'order.amount_cents': order.amountCents,
'payment.method': order.paymentMethod,
});
try {
const result = await chargeCard(order);
span.setAttributes({ 'payment.status': result.status });
return result;
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
Sampling Strategies
| Strategy | When | Config |
|---|
| Always On | Dev/staging, low traffic (<100 rps) | ratio: 1.0 |
| Probabilistic | Moderate traffic (100-1000 rps) | ratio: 0.1 (10%) |
| Rate-limited | High traffic (>1000 rps) | max_traces_per_second: 100 |
| Tail-based | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 |
| Parent-based | Respect upstream decisions | If parent sampled, child sampled |
Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.
Context Propagation
| Header | Standard | Format |
|---|
traceparent | W3C Trace Context | 00-{trace_id}-{span_id}-{flags} |
tracestate | W3C Trace Context | Vendor-specific key-value pairs |
b3 | Zipkin B3 | {trace_id}-{span_id}-{sampled} |
Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.
Trace Storage
| Volume | Solution | Retention |
|---|
| <50 GB/day | Jaeger + Elasticsearch | 7 days |
| 50-500 GB/day | Tempo + S3 | 14 days |
| 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days |
| Budget-constrained | Jaeger + Badger (local disk) | 3 days |
Phase 4: SLOs, SLIs & Error Budgets
SLI Selection by Service Type
| Service Type | Primary SLI | Secondary SLI | Measurement |
|---|
| API / Web | Availability + Latency | Error rate | Server-side + synthetic |
| Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums |
| Storage | Durability + Availability | Latency | Checksums + uptime monitoring |
| Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency |
| Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics |
SLO Definition Template
slo:
name: "Payment API Availability"
service: payment-api
owner: payments-team
sli:
type: availability
definition: "Proportion of non-5xx responses"
measurement: |
sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
target: 99.95% # 21.9 min downtime/month
window: rolling_30d
error_budget:
total_minutes: 21.9 # per 30 days
burn_rate_alerts:
- severity: critical
burn_rate: 14.4x # Budget consumed in 2 hours
short_window: 5m
long_window: 1h
- severity: warning
burn_rate: 6x # Budget consumed in 5 days
short_window: 30m
long_window: 6h
- severity: ticket
burn_rate: 1x # Budget consumed in 30 days
short_window: 6h
long_window: 3d
consequences:
budget_remaining_above_50pct: "Normal development velocity"
budget_remaining_20_to_50pct: "Prioritize reliability work"
budget_remaining_below_20pct: "Feature freeze; reliability only"
budget_exhausted: "All hands on reliability until budget recovers"
Common SLO Targets
| Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime |
|---|
| Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min |
| Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min |
| Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min |
| Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours |
| Batch / pipeline | 99% (success rate) | N/A | N/A | N/A |
Error Budget Tracking
# Weekly error budget review template
error_budget_review:
week: "2026-W08"
service: payment-api
slo_target: 99.95%
budget:
total_minutes_this_period: 21.9
consumed_minutes: 8.2
remaining_minutes: 13.7
remaining_percent: 62.6%
incidents_consuming_budget:
- date: "2026-02-18"
duration_minutes: 5.1
cause: "Database connection pool exhaustion"
preventable: true
action: "Increase pool size + add saturation alert"
- date: "2026-02-20"
duration_minutes: 3.1
cause: "Upstream payment provider timeout"
preventable: false
action: "Add circuit breaker with fallback"
velocity_decision: "Normal — 62.6% budget remaining"
reliability_work_this_week:
- "Add connection pool saturation alert"
- "Implement circuit breaker for payment provider"
Phase 5: Alert Design
Alert Quality Principles
- Every alert must be actionable — if no one needs to act, it's not an alert
- Every alert needs a runbook — linked directly in the alert annotation
- Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
- Multi-window burn rate — not static thresholds (see SLO alerts above)
- Alert on absence, not just presence — "no orders in 15 min" catches silent failures
Alert Severity Levels
| Severity | Response Time | Channel | Who | Example |
|---|
| P0 — Critical | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down |
| P1 — High | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min |
| P2 — Medium | <4 hours | Slack channel | Team | p99 latency degraded 2x |
| P3 — Low | Next business day | Ticket auto-created | Team backlog | Disk usage >80% |
| Info | N/A | Dashboard only | No one | Deploy completed |
Alerting Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts |
| Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate |
| No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval |
| Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook |
| Threshold too sensitive | Fires on brief spikes | Use for: 5m to require sustained condition |
| Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts |
Alert Template (Prometheus Alertmanager)
groups:
- name: payment-api-slo
rules:
- alert: PaymentAPIHighErrorRate
expr: |
(
sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
) > 0.01
for: 5m
labels:
severity: critical
service: payment-api
team: payments
annotations:
summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
description: "5xx error rate has exceeded 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-errors"
dashboard: "https://grafana.internal/d/payment-api"
- alert: PaymentAPINoTraffic
expr: |
sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
for: 5m
labels:
severity: critical
service: payment-api
annotations:
summary: "Payment API receiving zero traffic for 5 minutes"
runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"
- alert: PaymentAPILatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
runbook: "https://wiki.internal/runbooks/payment-api-latency"
Runbook Template
# Runbook: PaymentAPIHighErrorRate
## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.
## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)
## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
- Database: [dashboard link]
- Stripe API: [status page]
- Redis cache: [dashboard link]
4. Check application logs:
kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'
## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |
## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging
## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min
Phase 6: Dashboard Architecture
Dashboard Hierarchy
L1: Executive / Business Dashboard (non-technical stakeholders)
↓
L2: Service Overview Dashboard (on-call, quick triage)
↓
L3: Service Deep-Dive Dashboard (debugging specific service)
↓
L4: Infrastructure Dashboard (resource-level details)
L1: Business Dashboard
panels:
- title: "Revenue per Minute"
type: stat
query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
- title: "Active Users (5min)"
type: stat
query: "count(count by (user_id) (http_requests_total{...}[5m]))"
- title: "Checkout Success Rate"
type: gauge
query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
thresholds: [95, 98, 99.5]
- title: "Error Budget Remaining"
type: gauge
query: "1 - (error_budget_consumed / error_budget_total)"
L2: Service Overview Dashboard
Every service gets one of these with identical layout:
row_1_traffic:
- "Request Rate (rps)" — timeseries, by status code
- "Error Rate (%)" — timeseries, threshold line at SLO
- "Active Requests" — gauge
row_2_latency:
- "Latency Distribution" — heatmap
- "p50 / p95 / p99" — timeseries, threshold lines
- "Latency by Endpoint" — table, sorted by p99
row_3_dependencies:
- "Downstream Latency" — timeseries per dependency
- "Downstream Error Rate" — timeseries per dependency
- "Database Query Duration" — timeseries by query type
row_4_resources:
- "CPU Usage" — timeseries per pod
- "Memory Usage" — timeseries per pod
- "Pod Restarts" — stat
row_5_business:
- "Business Metric 1" — service-specific
- "Business Metric 2" — service-specific
Dashboard Rules
- Time range default: last 1 hour — most debugging happens in recent time
- Variable selectors at top: environment, service, instance
- Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
- Link alerts to dashboards — every alert annotation includes dashboard URL
- No more than 15 panels per dashboard — split into L3 if needed
- Include "as of" timestamp — so screenshots in incidents are unambiguous
- Dashboard as code — store Grafana JSON in git, provision via API
Phase 7: Incident Response
Incident Severity Classification
| Severity | Criteria | Response | Communication |
|---|
| SEV-1 | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min |
| SEV-2 | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min |
| SEV-3 | Minor degradation, workaround exists | On-call during hours | Internal Slack update |
| SEV-4 | Cosmetic, low impact | Next sprint | None |
Incident Roles
| Role | Responsibility | Who |
|---|
| Incident Commander (IC) | Owns the incident. Coordinates. Makes decisions. | On-call lead |
| Technical Lead | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer |
| Communications Lead | Updates status page, Slack, stakeholders. | Product/support |
| Scribe | Documents timeline, actions, decisions in real-time. | Anyone available |
Incident Response Workflow
1. DETECT
- Alert fires → on-call paged
- Customer report → support escalates
- Internal discovery → engineer reports
2. TRIAGE (first 5 minutes)
- Confirm the issue is real (not false alert)
- Classify severity (SEV-1 through SEV-4)
- Open incident channel: #inc-YYYY-MM-DD-short-description
- Assign roles (IC, Tech Lead, Comms)
3. MITIGATE (next 5-30 minutes)
- Goal: STOP THE BLEEDING, not find root cause
- Options (try in order):
a. Rollback last deploy
b. Scale up / restart pods
c. Toggle feature flag off
d. Redirect traffic / enable fallback
e. Manual data fix
- Document every action with timestamp
4. STABILIZE
- Confirm mitigation is working (metrics back to normal)
- Monitor for 15-30 min for recurrence
- Update status page: "Monitoring fix"
5. RESOLVE
- Confirm all metrics healthy for 30+ min
- Update status page: "Resolved"
- Schedule post-mortem (within 48 hours for SEV-1/2)
- Send internal summary to stakeholders
Incident Channel Template
📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie
Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes
Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved
Phase 8: Post-Mortem Framework
Blameless Post-Mortem Template
post_mortem:
title: "Payment API Connection Pool Exhaustion"
date: "2026-02-22"
severity: SEV-2
duration: 27 minutes (14:23 — 14:50 UTC)
authors: ["@alice", "@bob"]
reviewers: ["@engineering-leads"]
status: action_items_in_progress
summary: |
A deployment at 14:15 introduced a connection leak in the payment API.
Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
checkout requests. Rolled back at 14:31; recovered by 14:50.
impact:
user_impact: "~340 users saw checkout failures over 27 minutes"
revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
data_impact: "No data loss. 12 orders failed; users could retry successfully."
timeline:
- time: "14:15"
event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
- time: "14:23"
event: "PaymentAPIHighErrorRate alert fired"
- time: "14:25"
event: "IC assigned, confirmed via dashboard"
- time: "14:28"
event: "Root cause identified: new ORM query not releasing connections"
- time: "14:31"
event: "Rollback initiated: v2.3.1 → v2.3.0"
- time: "14:35"
event: "Error rate declining"
- time: "14:50"
event: "Resolved: error rate <0.1% sustained"
root_cause: |
The v2.3.1 deploy introduced a new database query in the order validation
path. The query used a raw connection instead of the pool's managed client,
so connections were acquired but never released. Under load, the pool
exhausted within 8 minutes.
contributing_factors:
- "No integration test for connection pool behavior under load"
- "Connection pool saturation metric existed but had no alert"
- "Code review didn't catch raw connection usage"
what_went_well:
- "Alert fired within 8 minutes of deploy"
- "IC assigned in 2 minutes"
- "Root cause identified in 3 minutes (clear in logs)"
- "Rollback executed cleanly"
what_went_wrong:
- "8-minute detection gap after deploy"
- "No canary deployment to catch before full rollout"
- "Connection pool saturation had no alert"
action_items:
- action: "Add connection pool saturation alert (>80% for 2 min)"
owner: "@bob"
priority: P1
due: "2026-02-25"
status: in_progress
ticket: "ENG-1234"
- action: "Enable canary deployments for payment-api"
owner: "@alice"
priority: P1
due: "2026-03-01"
ticket: "ENG-1235"
- action: "Add linting rule: no raw DB connections in application code"
owner: "@charlie"
priority: P2
due: "2026-03-07"
ticket: "ENG-1236"
- action: "Load test payment-api connection pool in staging"
owner: "@bob"
priority: P2
due: "2026-03-07"
ticket: "ENG-1237"
lessons_learned:
- "Resource saturation metrics need alerts, not just dashboards"
- "Canary deployments are mandatory for Tier 0 services"
- "ORM abstractions don't guarantee connection safety — review raw queries"
Post-Mortem Meeting Agenda (60 minutes)
1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in
5 Whys Exercise
Problem: 5xx errors in payment API
Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this
Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting
Phase 9: On-Call Operations
On-Call Structure
on_call:
rotation: weekly
handoff_day: Monday 10:00 UTC
primary:
response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
escalation_after: 15 minutes no-ack
secondary:
response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
escalation_after: 30 minutes no-ack
manager_escalation:
trigger: SEV-1 unresolved after 30 minutes
handoff_checklist:
- Review open incidents and active alerts
- Check error budget status for all services
- Read post-mortems from previous week
- Verify PagerDuty schedule and contact info
- Test alert routing (send test page)
On-Call Health Metrics
| Metric | Healthy | Needs Attention | Unhealthy |
|---|
| Pages per week | <5 | 5-15 | >15 |
| After-hours pages per week | <2 | 2-5 | >5 |
| False positive rate | <10% | 10-30% | >30% |
| Mean time to acknowledge | <5 min | 5-15 min | >15 min |
| Mean time to resolve | <30 min | 30-120 min | >120 min |
| Toil ratio (manual vs automated) | <30% | 30-60% | >60% |
Weekly On-Call Review Template
on_call_review:
week: "2026-W08"
engineer: "@bob"
incidents:
total: 7
sev_1: 0
sev_2: 1
sev_3: 4
false_positives: 2
after_hours: 3
time_spent:
incident_response: "4.5 hours"
toil_automation: "2 hours"
runbook_updates: "1 hour"
improvements_made:
- "Silenced noisy disk alert on dev servers"
- "Added auto-remediation for pod restart threshold"
improvements_needed:
- "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
- "Payment retry logic needs circuit breaker (caused 3 alerts)"
handoff_notes: |
Watch payment-api p99 latency — it's been creeping up since Wednesday.
Stripe changed their sandbox endpoints; staging may throw errors.
Phase 10: Chaos Engineering & Reliability Testing
Chaos Principles
- Start with a hypothesis: "If X fails, the system should Y"
- Run in production (start small — one instance, one AZ)
- Minimize blast radius with automatic rollback
- Build confidence incrementally: staging → canary → production
Chaos Experiment Template
chaos_experiment:
name: "Payment DB failover"
hypothesis: "If the primary database becomes unavailable, traffic should
failover to the replica within 30 seconds with <1% error rate spike"
steady_state:
- metric: "checkout_success_rate"
expected: ">99.5%"
- metric: "db_query_duration_p99"
expected: "<200ms"
injection:
type: "network_partition"
target: "payment-db-primary"
duration: "5 minutes"
blast_radius: "single AZ"
abort_conditions:
- "checkout_success_rate < 95% for > 60 seconds"
- "revenue_per_minute drops > 50%"
- "any SEV-1 incident declared"
results:
failover_time: "22 seconds"
error_spike: "0.3% for 25 seconds"
hypothesis_confirmed: true
follow_up_actions:
- "Document failover behavior in runbook"
- "Add failover time as SLI (target: <30s)"
Chaos Engineering Maturity Levels
| Level | What You Test | Tools |
|---|
| 1: Manual | Kill a pod, see what happens | kubectl delete pod |
| 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus |
| 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination |
| 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh |
Phase 11: Observability Cost Optimization
Cost Drivers (Ranked)
| # | Driver | Typical % of Bill | Optimization |
|---|
| 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive |
| 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels |
| 3 | Trace volume | 10-20% | Sampling, tail-based sampling |
| 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) |
| 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits |
Cost Reduction Checklist
cost_optimization:
logs:
- action: "Drop DEBUG/TRACE in production"
savings: "30-50% of log volume"
- action: "Sample health check logs (1:100)"
savings: "5-15% of log volume"
- action: "Deduplicate identical error bursts"
savings: "10-20% during incidents"
- action: "Move logs older than 7 days to S3/cold storage"
savings: "60-80% of storage cost"
- action: "Drop request/response body logging"
savings: "20-40% of log volume"
metrics:
- action: "Audit unused metrics (no dashboard, no alert)"
savings: "10-30% of series"
- action: "Reduce histogram bucket count (default 11 → 8)"
savings: "~27% of histogram series"
- action: "Remove high-cardinality labels"
savings: "Variable — can be massive"
- action: "Increase scrape interval for non-critical metrics (15s → 60s)"
savings: "75% of data points for those metrics"
traces:
- action: "Implement tail-based sampling"
savings: "80-95% of trace volume"
- action: "Drop internal health check traces"
savings: "5-20% of trace volume"
- action: "Reduce span attribute size (truncate long strings)"
savings: "10-30% of trace storage"
general:
- action: "Review and right-size retention policies quarterly"
- action: "Set query timeouts and result limits on dashboards"
- action: "Use recording rules for expensive queries"
Monthly Cost Review Template
observability_cost_review:
month: "February 2026"
total_cost: "$X,XXX"
breakdown:
logs: { volume: "X TB", cost: "$X", pct: "X%" }
metrics: { series: "X million", cost: "$X", pct: "X%" }
traces: { volume: "X TB", cost: "$X", pct: "X%" }
infrastructure: { instances: X, cost: "$X", pct: "X%" }
cost_per:
request: "$0.000X"
service: "$X average"
engineer: "$X per engineer"
optimizations_applied: []
optimizations_planned: []
budget_status: "on_track | over_budget | under_budget"
Phase 12: Advanced Patterns
Correlation: Connecting the Three Pillars
Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label
Correlation paths:
Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
→ Trace search (same service + time) → Find failing trace
→ Logs (filter by trace_id) → See exact error
Support ticket (user report) → Find request_id in logs
→ Extract trace_id → View full trace → Identify slow span
→ Check span's service metrics → Confirm pattern
Synthetic Monitoring
synthetic_checks:
- name: "Checkout flow"
type: browser
frequency: 5m
locations: [us-east, eu-west, ap-southeast]
steps:
- navigate: "https://app.example.com/products"
- click: "Add to Cart"
- click: "Checkout"
- assert: "Order confirmation page loads in <3s"
alert_on: "2 consecutive failures from same location"
- name: "API health"
type: api
frequency: 1m
endpoints:
- url: "https://api.example.com/health"
expected_status: 200
max_latency_ms: 500
- url: "https://api.example.com/v1/products?limit=1"
expected_status: 200
max_latency_ms: 1000
Feature Flag Observability
# Correlate feature flags with metrics
feature_flag_monitoring:
- flag: "new_checkout_flow"
metrics_to_compare:
- "checkout_conversion_rate" # by flag variant
- "checkout_error_rate"
- "checkout_latency_p99"
alerts:
- "If error rate for new variant > 2x control, auto-disable flag"
Observability Maturity Model
| Dimension | Level 1 | Level 2 | Level 3 | Level 4 |
|---|
| Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis |
| Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) |
| Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing |
| Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation |
| Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod |
| Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature |
Quality Scoring Rubric (0-100)
| Dimension | Weight | 0 | 5 | 10 |
|---|
| Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing |
| Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom |
| Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based |
| SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review |
| Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive |
| Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering |
| Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts |
| Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget |
90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.
10 Observability Commandments
- Structured or it didn't happen — unstructured logs are technical debt
- Correlate everything — trace_id connects logs, traces, and metrics
- Alert on symptoms, not causes — users don't care about CPU, they care about latency
- Every alert gets a runbook — no runbook = no alert
- SLOs drive velocity — error budgets decide when to ship vs stabilize
- Dashboards have hierarchy — executives don't need pod CPU graphs
- Blameless post-mortems always — blame prevents learning
- Cost is a feature — observability that bankrupts you isn't observability
- You build it, you run it — the team that ships code owns its observability
- Practice failure — chaos engineering builds confidence
12 Natural Language Commands
| Command | What It Does |
|---|
| "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps |
| "Design logging for [service]" | Generate structured log schema with context fields for the service |
| "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan |
| "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules |
| "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates |
| "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications |
| "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes |
| "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items |
| "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist |
| "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions |
| "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan |
| "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy |
⚡ Level Up Your Observability
This skill gives you the methodology. For industry-specific implementation patterns:
🔗 More Free Skills by AfrexAI
afrexai-devops-engine — CI/CD, infrastructure, deployment strategies
afrexai-api-architect — API design, security, versioning
afrexai-database-engineering — Schema design, query optimization, migrations
afrexai-code-reviewer — Code review methodology with SPEAR framework
afrexai-prompt-engineering — System prompt design, testing, optimization
Browse all AfrexAI skills: clawhub.com | Full storefront