afrexai-observability-engine

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "afrexai-observability-engine" with this command: npx skills add 1kalin/afrexai-observability-engine

Observability & Reliability Engineering

Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.


Quick Health Check (/16)

Score your current observability posture:

SignalHealthy (2)Weak (1)Missing (0)
Structured loggingJSON logs with trace_id correlationLogs exist but unstructuredConsole.log / print statements
Metrics collectionRED/USE metrics with dashboardsSome metrics, no dashboardsNo metrics
Distributed tracingFull request path with samplingPartial traces, key services onlyNo tracing
AlertingSLO-based alerts with runbooksThreshold alerts, some runbooksNo alerts or all-noise
Incident responseDefined process with roles + post-mortemsAd-hoc response, some docs"Whoever notices fixes it"
SLOs definedSLOs with error budgets tracked weeklyInformal availability targetsNo reliability targets
On-call rotationStructured rotation with escalationInformal "call someone"No on-call
Cost managementObservability budget tracked monthlySome awareness of costsNo idea what you spend

12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.


Phase 1: Structured Logging

Log Architecture

Application → Structured JSON → Log Router → Storage → Query Engine
                                    ↓
                              Alert Pipeline

Required Fields (Every Log Line)

FieldTypePurposeExample
timestampISO-8601 UTCWhen2026-02-22T18:30:00.123Z
levelenumSeverityinfo, warn, error, fatal
servicestringWhich servicepayment-api
versionstringWhich deployv2.3.1
environmentstringWhich envproduction
messagestringWhat happenedPayment processed successfully
trace_idstringRequest correlationabc123def456
span_idstringOperation within tracespan_789
duration_msnumberHow long142

Contextual Fields (Add Per Domain)

# HTTP request context
http:
  method: POST
  path: /api/v1/orders
  status: 201
  client_ip: 203.0.113.42  # Anonymize in logs if needed
  user_agent: "Mozilla/5.0..."
  request_id: "req_abc123"

# Business context
business:
  user_id: "usr_456"
  tenant_id: "tenant_789"
  order_id: "ord_012"
  action: "checkout"
  amount_cents: 4999
  currency: "USD"

# Error context
error:
  type: "PaymentDeclinedError"
  message: "Card declined: insufficient funds"
  code: "CARD_DECLINED"
  stack: "..." # Only in non-production or DEBUG level
  retry_count: 2
  retryable: true

Log Level Decision Tree

Is the process about to crash?
  → FATAL (exit after logging)

Did an operation fail that needs human attention?
  → ERROR (page someone or create ticket)

Did something unexpected happen but we recovered?
  → WARN (review in daily triage)

Is this a normal business event worth recording?
  → INFO (audit trail, business metrics)

Is this useful for debugging but noisy in production?
  → DEBUG (off in prod, on in staging)

Is this only useful when stepping through code?
  → TRACE (never in production)

Log Level Rules

  1. ERROR means action required — if no one needs to act on it, it's WARN
  2. INFO is for business events — not internal implementation details
  3. No logging inside tight loops — aggregate and log summary
  4. Log at boundaries — API entry/exit, queue consume/publish, DB calls
  5. Never log secrets — API keys, tokens, passwords, PII (see scrubbing below)

PII & Secret Scrubbing

scrub_patterns:
  # Always redact
  - field_patterns: ["password", "secret", "token", "api_key", "authorization"]
    action: replace_with_redacted
  
  # Hash for correlation without exposure
  - field_patterns: ["email", "phone", "ssn", "national_id"]
    action: sha256_hash
  
  # Mask partially
  - field_patterns: ["credit_card", "card_number"]
    action: mask_last_4  # "****-****-****-1234"
  
  # IP anonymization
  - field_patterns: ["client_ip", "ip_address"]
    action: zero_last_octet  # 203.0.113.0

Logger Setup (By Language)

Node.js (Pino):

import pino from 'pino';
import { AsyncLocalStorage } from 'node:async_hooks';

const als = new AsyncLocalStorage<Record<string, string>>();

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin: () => als.getStore() ?? {},
  redact: ['req.headers.authorization', '*.password', '*.token'],
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Middleware: inject context
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    service: 'payment-api',
    version: process.env.APP_VERSION,
  };
  als.run(ctx, () => next());
});

Python (structlog):

import structlog
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()
# Bind context per-request:
structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id)

Go (zerolog):

log := zerolog.New(os.Stdout).With().
    Timestamp().
    Str("service", "payment-api").
    Str("version", version).
    Logger()
// Per-request:
reqLog := log.With().Str("trace_id", traceID).Logger()

Log Storage Decision

VolumeSolutionRetentionCost
<10 GB/dayLoki + Grafana30 days hot, 90 days coldLow
10-100 GB/dayElasticsearch / OpenSearch14 days hot, 90 days S3Medium
100+ GB/dayClickHouse or Datadog7 days hot, 30 days archiveHigh
Budget-constrainedLoki + S3 backend90 days all coldVery low

10 Logging Anti-Patterns

#Anti-PatternFix
1log.error(err) with no contextAlways include: what operation, what input, what state
2Logging request/response bodiesLog only in DEBUG; redact sensitive fields
3String concatenation in log messagesUse structured fields: log.info("processed", { order_id, amount })
4Catch-and-log-and-rethrowLog at the boundary where you handle it, not every layer
5Different log formats per serviceStandardize schema across all services
6No log rotation / retention policySet max size + TTL; archive to cold storage
7Logging inside hot pathsAggregate: log summary every N items or every interval
8Missing correlation IDsPropagate trace_id from first entry point through all services
9Boolean log levels (verbose: true)Use standard levels with configurable minimum
10Logging PII in plain textImplement scrubbing at the logger level

Phase 2: Metrics Collection

The RED Method (Request-Driven Services)

For every service endpoint, track:

MetricWhatPrometheus Example
RateRequests per secondhttp_requests_total{method, path, status}
ErrorsFailed requests per secondhttp_requests_total{status=~"5.."} / total
DurationLatency distributionhttp_request_duration_seconds{method, path} (histogram)

The USE Method (Infrastructure Resources)

For every resource (CPU, memory, disk, network):

MetricWhatExample
Utilization% resource busyCPU usage 78%
SaturationQueue depth / backpressure12 requests queued
ErrorsResource errors3 disk I/O errors

Golden Signals (Google SRE)

SignalMeaningSource
LatencyTime to serve requestsRED Duration
TrafficDemand on the systemRED Rate
ErrorsRate of failed requestsRED Errors
SaturationHow "full" the service isUSE Saturation

Metric Types & When to Use Each

TypeUse CaseExample
CounterThings that only go upTotal requests, errors, bytes sent
GaugeCurrent value that goes up/downActive connections, queue depth, temperature
HistogramDistribution of valuesRequest latency, response size
SummaryPre-calculated percentilesClient-side latency (when you need exact percentiles)

Rule: Use histograms over summaries in most cases — they're aggregatable across instances.

Naming Conventions

# Pattern: <namespace>_<subsystem>_<name>_<unit>
http_server_request_duration_seconds
http_server_requests_total
db_pool_connections_active
queue_messages_pending
cache_hit_ratio

# Rules:
# 1. Use snake_case
# 2. Include unit suffix (_seconds, _bytes, _total)
# 3. _total suffix for counters
# 4. Don't include label names in metric name
# 5. Use base units (seconds not milliseconds, bytes not kilobytes)

Label Design Rules

RuleWhyExample
Keep cardinality <100 per labelHigh cardinality kills performancestatus="200" not status="200 OK"
No user IDs as labelsUnbounded cardinalityUse log correlation instead
No request paths with IDs/api/users/123 creates millions of seriesNormalize: /api/users/:id
Max 5-7 labels per metricEach combo = a time series{method, path, status, service}

Instrumentation Checklist

application_metrics:
  # HTTP layer
  - http_request_duration_seconds: histogram {method, path, status}
  - http_request_size_bytes: histogram {method, path}
  - http_response_size_bytes: histogram {method, path}
  - http_requests_in_flight: gauge
  
  # Business logic
  - orders_processed_total: counter {status, payment_method}
  - order_value_dollars: histogram {payment_method}
  - user_signups_total: counter {source}
  
  # Dependencies
  - db_query_duration_seconds: histogram {query_type, table}
  - db_connections_active: gauge {pool}
  - db_connections_idle: gauge {pool}
  - cache_requests_total: counter {result: hit|miss}
  - external_api_duration_seconds: histogram {service, endpoint}
  - external_api_errors_total: counter {service, error_type}
  
  # Queue / async
  - queue_messages_published_total: counter {queue}
  - queue_messages_consumed_total: counter {queue, status}
  - queue_processing_duration_seconds: histogram {queue}
  - queue_depth: gauge {queue}
  - queue_consumer_lag: gauge {queue, consumer_group}

infrastructure_metrics:
  # Node exporter / cAdvisor provides these automatically
  - cpu_usage_percent: gauge {instance}
  - memory_usage_bytes: gauge {instance}
  - disk_usage_bytes: gauge {instance, mount}
  - disk_io_seconds: counter {instance, device}
  - network_bytes: counter {instance, direction}
  - container_cpu_usage: gauge {pod, container}
  - container_memory_usage: gauge {pod, container}

Stack Recommendations

ComponentOptionsRecommendation
CollectionPrometheus, OTEL Collector, Datadog AgentPrometheus (free) or OTEL Collector (vendor-neutral)
StoragePrometheus, Thanos, Mimir, VictoriaMetricsVictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem)
VisualizationGrafana, Datadog, New RelicGrafana (free, extensible)
AlertingAlertmanager, Grafana Alerting, PagerDutyAlertmanager + PagerDuty routing

Phase 3: Distributed Tracing

Trace Architecture

Client Request
  → API Gateway (root span)
    → Auth Service (child span)
    → Order Service (child span)
      → Database Query (child span)
      → Payment Service (child span)
        → Stripe API (child span)
    → Notification Service (child span)
      → Email Provider (child span)

OpenTelemetry Setup

Auto-instrumentation (Node.js):

// tracing.ts — import BEFORE anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] },
    '@opentelemetry/instrumentation-express': { enabled: true },
  })],
  serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api',
});
sdk.start();

Custom spans for business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

async function processPayment(order: Order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttributes({
      'order.id': order.id,
      'order.amount_cents': order.amountCents,
      'payment.method': order.paymentMethod,
    });
    try {
      const result = await chargeCard(order);
      span.setAttributes({ 'payment.status': result.status });
      return result;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

StrategyWhenConfig
Always OnDev/staging, low traffic (<100 rps)ratio: 1.0
ProbabilisticModerate traffic (100-1000 rps)ratio: 0.1 (10%)
Rate-limitedHigh traffic (>1000 rps)max_traces_per_second: 100
Tail-basedWant all errors + slow requestsCollector-side: keep if error OR duration > p99
Parent-basedRespect upstream decisionsIf parent sampled, child sampled

Recommendation: Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors.

Context Propagation

HeaderStandardFormat
traceparentW3C Trace Context00-{trace_id}-{span_id}-{flags}
tracestateW3C Trace ContextVendor-specific key-value pairs
b3Zipkin B3{trace_id}-{span_id}-{sampled}

Rule: Use W3C Trace Context (traceparent) as primary. Support B3 for legacy Zipkin systems.

Trace Storage

VolumeSolutionRetention
<50 GB/dayJaeger + Elasticsearch7 days
50-500 GB/dayTempo + S314 days
500+ GB/dayTempo + S3 with aggressive sampling7 days
Budget-constrainedJaeger + Badger (local disk)3 days

Phase 4: SLOs, SLIs & Error Budgets

SLI Selection by Service Type

Service TypePrimary SLISecondary SLIMeasurement
API / WebAvailability + LatencyError rateServer-side + synthetic
Data pipelineFreshness + CorrectnessThroughputPipeline timestamps + checksums
StorageDurability + AvailabilityLatencyChecksums + uptime monitoring
StreamingThroughput + LatencyMessage loss rateConsumer lag + e2e latency
Batch jobsSuccess rate + FreshnessDurationJob scheduler metrics

SLO Definition Template

slo:
  name: "Payment API Availability"
  service: payment-api
  owner: payments-team
  
  sli:
    type: availability
    definition: "Proportion of non-5xx responses"
    measurement: |
      sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="payment-api"}[5m]))
    
  target: 99.95%  # 21.9 min downtime/month
  window: rolling_30d
  
  error_budget:
    total_minutes: 21.9  # per 30 days
    burn_rate_alerts:
      - severity: critical
        burn_rate: 14.4x  # Budget consumed in 2 hours
        short_window: 5m
        long_window: 1h
      - severity: warning
        burn_rate: 6x    # Budget consumed in 5 days
        short_window: 30m
        long_window: 6h
      - severity: ticket
        burn_rate: 1x    # Budget consumed in 30 days
        short_window: 6h
        long_window: 3d
  
  consequences:
    budget_remaining_above_50pct: "Normal development velocity"
    budget_remaining_20_to_50pct: "Prioritize reliability work"
    budget_remaining_below_20pct: "Feature freeze; reliability only"
    budget_exhausted: "All hands on reliability until budget recovers"

Common SLO Targets

Service TierAvailabilityp50 Latencyp99 LatencyMonthly Downtime
Tier 0 (payments, auth)99.99%<100ms<500ms4.3 min
Tier 1 (core API)99.95%<200ms<1s21.9 min
Tier 2 (non-critical)99.9%<500ms<2s43.8 min
Tier 3 (internal tools)99.5%<1s<5s3.6 hours
Batch / pipeline99% (success rate)N/AN/AN/A

Error Budget Tracking

# Weekly error budget review template
error_budget_review:
  week: "2026-W08"
  service: payment-api
  slo_target: 99.95%
  
  budget:
    total_minutes_this_period: 21.9
    consumed_minutes: 8.2
    remaining_minutes: 13.7
    remaining_percent: 62.6%
    
  incidents_consuming_budget:
    - date: "2026-02-18"
      duration_minutes: 5.1
      cause: "Database connection pool exhaustion"
      preventable: true
      action: "Increase pool size + add saturation alert"
    - date: "2026-02-20"
      duration_minutes: 3.1
      cause: "Upstream payment provider timeout"
      preventable: false
      action: "Add circuit breaker with fallback"
  
  velocity_decision: "Normal — 62.6% budget remaining"
  reliability_work_this_week:
    - "Add connection pool saturation alert"
    - "Implement circuit breaker for payment provider"

Phase 5: Alert Design

Alert Quality Principles

  1. Every alert must be actionable — if no one needs to act, it's not an alert
  2. Every alert needs a runbook — linked directly in the alert annotation
  3. Symptom-based over cause-based — alert on "users can't checkout" not "CPU high"
  4. Multi-window burn rate — not static thresholds (see SLO alerts above)
  5. Alert on absence, not just presence — "no orders in 15 min" catches silent failures

Alert Severity Levels

SeverityResponse TimeChannelWhoExample
P0 — Critical<5 minPage (PagerDuty/Opsgenie)On-call engineerPayment system down
P1 — High<30 minPage during business hours, Slack 24/7On-callError rate >5% for 10 min
P2 — Medium<4 hoursSlack channelTeamp99 latency degraded 2x
P3 — LowNext business dayTicket auto-createdTeam backlogDisk usage >80%
InfoN/ADashboard onlyNo oneDeploy completed

Alerting Anti-Patterns

Anti-PatternProblemFix
Static CPU/memory thresholdsNoisy, not user-impactingUse SLO-based burn rate alerts
Alert per instance50 instances = 50 alerts for same issueAggregate: alert on service-level error rate
No deduplicationSame alert fires 100 timesGroup by service + alert name; set repeat interval
Missing runbookEngineer gets paged, doesn't know what to doEvery alert links to a runbook
Threshold too sensitiveFires on brief spikesUse for: 5m to require sustained condition
Too many P0sAlert fatigue → ignoring real incidentsAudit monthly; demote or remove noisy alerts

Alert Template (Prometheus Alertmanager)

groups:
  - name: payment-api-slo
    rules:
      - alert: PaymentAPIHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="payment-api"}[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: critical
          service: payment-api
          team: payments
        annotations:
          summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)"
          description: "5xx error rate has exceeded 1% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-errors"
          dashboard: "https://grafana.internal/d/payment-api"
          
      - alert: PaymentAPINoTraffic
        expr: |
          sum(rate(http_requests_total{service="payment-api"}[15m])) == 0
        for: 5m
        labels:
          severity: critical
          service: payment-api
        annotations:
          summary: "Payment API receiving zero traffic for 5 minutes"
          runbook: "https://wiki.internal/runbooks/payment-api-no-traffic"

      - alert: PaymentAPILatencyHigh
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)"
          runbook: "https://wiki.internal/runbooks/payment-api-latency"

Runbook Template

# Runbook: PaymentAPIHighErrorRate

## What This Alert Means
The payment API is returning >1% 5xx errors over a 5-minute window.
Users are likely failing to complete checkouts.

## Impact
- Users cannot process payments
- Revenue loss: ~$X per minute (based on average traffic)
- SLO: Payment API availability (target: 99.95%)

## Immediate Actions
1. Check the error dashboard: [link]
2. Check recent deploys: `kubectl rollout history deployment/payment-api`
3. Check upstream dependencies:
   - Database: [dashboard link]
   - Stripe API: [status page]
   - Redis cache: [dashboard link]
4. Check application logs:

kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")'


## Common Causes & Fixes
| Cause | Diagnosis | Fix |
|-------|-----------|-----|
| Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` |
| DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size |
| Stripe outage | Stripe status page red | Enable fallback payment processor |
| Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate |

## Escalation
- If unresolved after 15 min: page payment team lead
- If revenue impact >$10K: page VP Engineering
- If Stripe outage: communicate to support team for customer messaging

## Resolution
- Confirm error rate <0.1% for 10 min
- Post in #incidents: root cause + duration + impact
- Schedule post-mortem if downtime >5 min

Phase 6: Dashboard Architecture

Dashboard Hierarchy

L1: Executive / Business Dashboard (non-technical stakeholders)
  ↓
L2: Service Overview Dashboard (on-call, quick triage)
  ↓
L3: Service Deep-Dive Dashboard (debugging specific service)
  ↓
L4: Infrastructure Dashboard (resource-level details)

L1: Business Dashboard

panels:
  - title: "Revenue per Minute"
    type: stat
    query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)"
  - title: "Active Users (5min)"
    type: stat
    query: "count(count by (user_id) (http_requests_total{...}[5m]))"
  - title: "Checkout Success Rate"
    type: gauge
    query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))"
    thresholds: [95, 98, 99.5]
  - title: "Error Budget Remaining"
    type: gauge
    query: "1 - (error_budget_consumed / error_budget_total)"

L2: Service Overview Dashboard

Every service gets one of these with identical layout:

row_1_traffic:
  - "Request Rate (rps)" — timeseries, by status code
  - "Error Rate (%)" — timeseries, threshold line at SLO
  - "Active Requests" — gauge

row_2_latency:
  - "Latency Distribution" — heatmap
  - "p50 / p95 / p99" — timeseries, threshold lines
  - "Latency by Endpoint" — table, sorted by p99

row_3_dependencies:
  - "Downstream Latency" — timeseries per dependency
  - "Downstream Error Rate" — timeseries per dependency
  - "Database Query Duration" — timeseries by query type

row_4_resources:
  - "CPU Usage" — timeseries per pod
  - "Memory Usage" — timeseries per pod
  - "Pod Restarts" — stat

row_5_business:
  - "Business Metric 1" — service-specific
  - "Business Metric 2" — service-specific

Dashboard Rules

  1. Time range default: last 1 hour — most debugging happens in recent time
  2. Variable selectors at top: environment, service, instance
  3. Consistent color coding: green=good, yellow=degraded, red=bad across all dashboards
  4. Link alerts to dashboards — every alert annotation includes dashboard URL
  5. No more than 15 panels per dashboard — split into L3 if needed
  6. Include "as of" timestamp — so screenshots in incidents are unambiguous
  7. Dashboard as code — store Grafana JSON in git, provision via API

Phase 7: Incident Response

Incident Severity Classification

SeverityCriteriaResponseCommunication
SEV-1Service down, data loss risk, security breachAll hands, war roomStatus page update every 15 min
SEV-2Degraded service, SLO at risk, partial outageOn-call + backupStatus page update every 30 min
SEV-3Minor degradation, workaround existsOn-call during hoursInternal Slack update
SEV-4Cosmetic, low impactNext sprintNone

Incident Roles

RoleResponsibilityWho
Incident Commander (IC)Owns the incident. Coordinates. Makes decisions.On-call lead
Technical LeadDiagnoses and fixes. Communicates technical status to IC.Senior engineer
Communications LeadUpdates status page, Slack, stakeholders.Product/support
ScribeDocuments timeline, actions, decisions in real-time.Anyone available

Incident Response Workflow

1. DETECT
   - Alert fires → on-call paged
   - Customer report → support escalates
   - Internal discovery → engineer reports
   
2. TRIAGE (first 5 minutes)
   - Confirm the issue is real (not false alert)
   - Classify severity (SEV-1 through SEV-4)
   - Open incident channel: #inc-YYYY-MM-DD-short-description
   - Assign roles (IC, Tech Lead, Comms)
   
3. MITIGATE (next 5-30 minutes)
   - Goal: STOP THE BLEEDING, not find root cause
   - Options (try in order):
     a. Rollback last deploy
     b. Scale up / restart pods
     c. Toggle feature flag off
     d. Redirect traffic / enable fallback
     e. Manual data fix
   - Document every action with timestamp
   
4. STABILIZE
   - Confirm mitigation is working (metrics back to normal)
   - Monitor for 15-30 min for recurrence
   - Update status page: "Monitoring fix"
   
5. RESOLVE
   - Confirm all metrics healthy for 30+ min
   - Update status page: "Resolved"
   - Schedule post-mortem (within 48 hours for SEV-1/2)
   - Send internal summary to stakeholders

Incident Channel Template

📋 Incident: Payment API 5xx Errors
🔴 Severity: SEV-2
🕐 Started: 2026-02-22 14:23 UTC
👤 IC: @alice
🔧 Tech Lead: @bob
📢 Comms: @charlie

Status: MITIGATING
Impact: ~5% of checkout requests failing
Customer-facing: Yes

Timeline:
14:23 — Alert fired: PaymentAPIHighErrorRate
14:25 — IC assigned: @alice, confirmed real via dashboard
14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy
14:31 — Rolled back deployment v2.3.1 → v2.3.0
14:35 — Error rate dropping, monitoring
14:50 — Error rate <0.1%, marking resolved

Phase 8: Post-Mortem Framework

Blameless Post-Mortem Template

post_mortem:
  title: "Payment API Connection Pool Exhaustion"
  date: "2026-02-22"
  severity: SEV-2
  duration: 27 minutes (14:23 — 14:50 UTC)
  authors: ["@alice", "@bob"]
  reviewers: ["@engineering-leads"]
  status: action_items_in_progress
  
  summary: |
    A deployment at 14:15 introduced a connection leak in the payment API.
    Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of
    checkout requests. Rolled back at 14:31; recovered by 14:50.
  
  impact:
    user_impact: "~340 users saw checkout failures over 27 minutes"
    revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)"
    slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)"
    data_impact: "No data loss. 12 orders failed; users could retry successfully."
  
  timeline:
    - time: "14:15"
      event: "Deploy v2.3.1 rolled out (3/3 pods updated)"
    - time: "14:23"
      event: "PaymentAPIHighErrorRate alert fired"
    - time: "14:25"
      event: "IC assigned, confirmed via dashboard"
    - time: "14:28"
      event: "Root cause identified: new ORM query not releasing connections"
    - time: "14:31"
      event: "Rollback initiated: v2.3.1 → v2.3.0"
    - time: "14:35"
      event: "Error rate declining"
    - time: "14:50"
      event: "Resolved: error rate <0.1% sustained"
  
  root_cause: |
    The v2.3.1 deploy introduced a new database query in the order validation
    path. The query used a raw connection instead of the pool's managed client,
    so connections were acquired but never released. Under load, the pool
    exhausted within 8 minutes.
  
  contributing_factors:
    - "No integration test for connection pool behavior under load"
    - "Connection pool saturation metric existed but had no alert"
    - "Code review didn't catch raw connection usage"
  
  what_went_well:
    - "Alert fired within 8 minutes of deploy"
    - "IC assigned in 2 minutes"
    - "Root cause identified in 3 minutes (clear in logs)"
    - "Rollback executed cleanly"
  
  what_went_wrong:
    - "8-minute detection gap after deploy"
    - "No canary deployment to catch before full rollout"
    - "Connection pool saturation had no alert"
  
  action_items:
    - action: "Add connection pool saturation alert (>80% for 2 min)"
      owner: "@bob"
      priority: P1
      due: "2026-02-25"
      status: in_progress
      ticket: "ENG-1234"
    - action: "Enable canary deployments for payment-api"
      owner: "@alice"
      priority: P1
      due: "2026-03-01"
      ticket: "ENG-1235"
    - action: "Add linting rule: no raw DB connections in application code"
      owner: "@charlie"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1236"
    - action: "Load test payment-api connection pool in staging"
      owner: "@bob"
      priority: P2
      due: "2026-03-07"
      ticket: "ENG-1237"
  
  lessons_learned:
    - "Resource saturation metrics need alerts, not just dashboards"
    - "Canary deployments are mandatory for Tier 0 services"
    - "ORM abstractions don't guarantee connection safety — review raw queries"

Post-Mortem Meeting Agenda (60 minutes)

1. (5 min) Context setting — IC reads the summary
2. (15 min) Timeline walkthrough — what happened, when, by whom
3. (15 min) Root cause deep-dive — 5 Whys exercise
4. (5 min) What went well — celebrate good response
5. (15 min) Action items — assign owners, priorities, due dates
6. (5 min) Wrap-up — review date for action item check-in

5 Whys Exercise

Problem: 5xx errors in payment API

Why 1: Database connections were exhausted
Why 2: A new query acquired connections without releasing them
Why 3: The query used a raw connection instead of the pool manager
Why 4: The ORM's raw query API doesn't auto-release (by design)
Why 5: We don't have a linting rule or code review checklist item for this

Root cause: Missing guard against raw connection usage in application code
Systemic fix: Linting rule + connection pool saturation alerting

Phase 9: On-Call Operations

On-Call Structure

on_call:
  rotation: weekly
  handoff_day: Monday 10:00 UTC
  
  primary:
    response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3)
    escalation_after: 15 minutes no-ack
    
  secondary:
    response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3)
    escalation_after: 30 minutes no-ack
    
  manager_escalation:
    trigger: SEV-1 unresolved after 30 minutes
    
  handoff_checklist:
    - Review open incidents and active alerts
    - Check error budget status for all services
    - Read post-mortems from previous week
    - Verify PagerDuty schedule and contact info
    - Test alert routing (send test page)

On-Call Health Metrics

MetricHealthyNeeds AttentionUnhealthy
Pages per week<55-15>15
After-hours pages per week<22-5>5
False positive rate<10%10-30%>30%
Mean time to acknowledge<5 min5-15 min>15 min
Mean time to resolve<30 min30-120 min>120 min
Toil ratio (manual vs automated)<30%30-60%>60%

Weekly On-Call Review Template

on_call_review:
  week: "2026-W08"
  engineer: "@bob"
  
  incidents:
    total: 7
    sev_1: 0
    sev_2: 1
    sev_3: 4
    false_positives: 2
    after_hours: 3
    
  time_spent:
    incident_response: "4.5 hours"
    toil_automation: "2 hours"
    runbook_updates: "1 hour"
    
  improvements_made:
    - "Silenced noisy disk alert on dev servers"
    - "Added auto-remediation for pod restart threshold"
    
  improvements_needed:
    - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation"
    - "Payment retry logic needs circuit breaker (caused 3 alerts)"
    
  handoff_notes: |
    Watch payment-api p99 latency — it's been creeping up since Wednesday.
    Stripe changed their sandbox endpoints; staging may throw errors.

Phase 10: Chaos Engineering & Reliability Testing

Chaos Principles

  1. Start with a hypothesis: "If X fails, the system should Y"
  2. Run in production (start small — one instance, one AZ)
  3. Minimize blast radius with automatic rollback
  4. Build confidence incrementally: staging → canary → production

Chaos Experiment Template

chaos_experiment:
  name: "Payment DB failover"
  hypothesis: "If the primary database becomes unavailable, traffic should
    failover to the replica within 30 seconds with <1% error rate spike"
  
  steady_state:
    - metric: "checkout_success_rate"
      expected: ">99.5%"
    - metric: "db_query_duration_p99"
      expected: "<200ms"
  
  injection:
    type: "network_partition"
    target: "payment-db-primary"
    duration: "5 minutes"
    blast_radius: "single AZ"
  
  abort_conditions:
    - "checkout_success_rate < 95% for > 60 seconds"
    - "revenue_per_minute drops > 50%"
    - "any SEV-1 incident declared"
  
  results:
    failover_time: "22 seconds"
    error_spike: "0.3% for 25 seconds"
    hypothesis_confirmed: true
    
  follow_up_actions:
    - "Document failover behavior in runbook"
    - "Add failover time as SLI (target: <30s)"

Chaos Engineering Maturity Levels

LevelWhat You TestTools
1: ManualKill a pod, see what happenskubectl delete pod
2: AutomatedScheduled pod kills, network delaysChaos Monkey, Litmus
3: Game DaysMulti-failure scenarios with team exerciseCustom scripts + coordination
4: ContinuousAutomated chaos in production with auto-rollbackGremlin, Chaos Mesh

Phase 11: Observability Cost Optimization

Cost Drivers (Ranked)

#DriverTypical % of BillOptimization
1Log volume40-60%Reduce verbosity, drop DEBUG, sample repetitive
2Metric cardinality15-25%Drop unused metrics, limit labels
3Trace volume10-20%Sampling, tail-based sampling
4Retention10-15%Tiered storage (hot → warm → cold)
5Query cost5-10%Optimize dashboard queries, set max scan limits

Cost Reduction Checklist

cost_optimization:
  logs:
    - action: "Drop DEBUG/TRACE in production"
      savings: "30-50% of log volume"
    - action: "Sample health check logs (1:100)"
      savings: "5-15% of log volume"
    - action: "Deduplicate identical error bursts"
      savings: "10-20% during incidents"
    - action: "Move logs older than 7 days to S3/cold storage"
      savings: "60-80% of storage cost"
    - action: "Drop request/response body logging"
      savings: "20-40% of log volume"
  
  metrics:
    - action: "Audit unused metrics (no dashboard, no alert)"
      savings: "10-30% of series"
    - action: "Reduce histogram bucket count (default 11 → 8)"
      savings: "~27% of histogram series"
    - action: "Remove high-cardinality labels"
      savings: "Variable — can be massive"
    - action: "Increase scrape interval for non-critical metrics (15s → 60s)"
      savings: "75% of data points for those metrics"
  
  traces:
    - action: "Implement tail-based sampling"
      savings: "80-95% of trace volume"
    - action: "Drop internal health check traces"
      savings: "5-20% of trace volume"
    - action: "Reduce span attribute size (truncate long strings)"
      savings: "10-30% of trace storage"
  
  general:
    - action: "Review and right-size retention policies quarterly"
    - action: "Set query timeouts and result limits on dashboards"
    - action: "Use recording rules for expensive queries"

Monthly Cost Review Template

observability_cost_review:
  month: "February 2026"
  total_cost: "$X,XXX"
  
  breakdown:
    logs: { volume: "X TB", cost: "$X", pct: "X%" }
    metrics: { series: "X million", cost: "$X", pct: "X%" }
    traces: { volume: "X TB", cost: "$X", pct: "X%" }
    infrastructure: { instances: X, cost: "$X", pct: "X%" }
  
  cost_per:
    request: "$0.000X"
    service: "$X average"
    engineer: "$X per engineer"
  
  optimizations_applied: []
  optimizations_planned: []
  budget_status: "on_track | over_budget | under_budget"

Phase 12: Advanced Patterns

Correlation: Connecting the Three Pillars

Every log line includes: trace_id, span_id
Every trace span includes: service, operation
Every metric includes: service label

Correlation paths:
  Alert fires (metric) → Click → Dashboard (metric) → Filter by time window
    → Trace search (same service + time) → Find failing trace
    → Logs (filter by trace_id) → See exact error
    
  Support ticket (user report) → Find request_id in logs
    → Extract trace_id → View full trace → Identify slow span
    → Check span's service metrics → Confirm pattern

Synthetic Monitoring

synthetic_checks:
  - name: "Checkout flow"
    type: browser
    frequency: 5m
    locations: [us-east, eu-west, ap-southeast]
    steps:
      - navigate: "https://app.example.com/products"
      - click: "Add to Cart"
      - click: "Checkout"
      - assert: "Order confirmation page loads in <3s"
    alert_on: "2 consecutive failures from same location"
    
  - name: "API health"
    type: api
    frequency: 1m
    endpoints:
      - url: "https://api.example.com/health"
        expected_status: 200
        max_latency_ms: 500
      - url: "https://api.example.com/v1/products?limit=1"
        expected_status: 200
        max_latency_ms: 1000

Feature Flag Observability

# Correlate feature flags with metrics
feature_flag_monitoring:
  - flag: "new_checkout_flow"
    metrics_to_compare:
      - "checkout_conversion_rate" # by flag variant
      - "checkout_error_rate"
      - "checkout_latency_p99"
    alerts:
      - "If error rate for new variant > 2x control, auto-disable flag"

Observability Maturity Model

DimensionLevel 1Level 2Level 3Level 4
LoggingUnstructured logsStructured JSON, centralizedCorrelated with tracesAutomated log analysis
MetricsBasic infra metricsRED/USE for servicesSLO-based with error budgetsPredictive (anomaly detection)
TracingNo tracingKey services instrumentedFull distributed tracingTrace-driven testing
AlertingStatic thresholdsMulti-signal alertsBurn-rate based on SLOsAuto-remediation
Incident ResponseAd hocDefined process + rolesPost-mortems with action trackingChaos engineering in prod
Culture"Ops team handles it"Shared ownership (you build it, you run it)SLO-driven development velocityReliability as a feature

Quality Scoring Rubric (0-100)

DimensionWeight0510
Logging quality15%Unstructured, no correlationStructured JSON, missing fieldsFull schema, trace correlation, PII scrubbing
Metrics coverage15%No metricsRED or USE, not bothRED + USE + business metrics + custom
Tracing completeness10%No tracingKey servicesFull path, sampling strategy, tail-based
SLO maturity15%No reliability targetsInformal targetsSLOs with error budgets, burn-rate alerts, weekly review
Alert quality15%Noisy/missingActionable, some runbooksSLO-based, full runbooks, low false positive
Incident response10%Ad hocDefined processFull process, roles, post-mortems, chaos engineering
Dashboard design10%No dashboardsBasic panelsHierarchical L1-L4, consistent, linked to alerts
Cost efficiency10%Unknown costTrackedOptimized, reviewed monthly, within budget

90-100: World-class. Teach others. 70-89: Production-ready. Fill specific gaps. 50-69: Functional but fragile. <50: Significant reliability risk.


10 Observability Commandments

  1. Structured or it didn't happen — unstructured logs are technical debt
  2. Correlate everything — trace_id connects logs, traces, and metrics
  3. Alert on symptoms, not causes — users don't care about CPU, they care about latency
  4. Every alert gets a runbook — no runbook = no alert
  5. SLOs drive velocity — error budgets decide when to ship vs stabilize
  6. Dashboards have hierarchy — executives don't need pod CPU graphs
  7. Blameless post-mortems always — blame prevents learning
  8. Cost is a feature — observability that bankrupts you isn't observability
  9. You build it, you run it — the team that ships code owns its observability
  10. Practice failure — chaos engineering builds confidence

12 Natural Language Commands

CommandWhat It Does
"Audit our observability"Run the /16 health check, score each dimension, prioritize gaps
"Design logging for [service]"Generate structured log schema with context fields for the service
"Set up metrics for [service]"Create RED + USE + business metric instrumentation plan
"Create SLOs for [service]"Define SLIs, targets, error budgets, and burn-rate alert rules
"Design alerts for [service]"Create alert rules with severity, thresholds, and runbook templates
"Build dashboard for [service]"Design L2 service overview dashboard with panel specifications
"Write a runbook for [alert]"Generate structured runbook with diagnosis steps and fixes
"Run post-mortem for [incident]"Generate blameless post-mortem document with timeline and action items
"Set up on-call for [team]"Design rotation, escalation policy, handoff checklist
"Plan chaos experiment for [scenario]"Design experiment with hypothesis, injection, abort conditions
"Optimize observability costs"Audit current spend, identify top savings, create reduction plan
"Design tracing for [system]"Create OpenTelemetry instrumentation plan with sampling strategy

⚡ Level Up Your Observability

This skill gives you the methodology. For industry-specific implementation patterns:

🔗 More Free Skills by AfrexAI

  • afrexai-devops-engine — CI/CD, infrastructure, deployment strategies
  • afrexai-api-architect — API design, security, versioning
  • afrexai-database-engineering — Schema design, query optimization, migrations
  • afrexai-code-reviewer — Code review methodology with SPEAR framework
  • afrexai-prompt-engineering — System prompt design, testing, optimization

Browse all AfrexAI skills: clawhub.com | Full storefront

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Web3

SRE & Incident Management Platform

Comprehensive SRE platform enabling SLO definition, reliability assessment, incident response, chaos engineering, and error budget management without externa...

Registry SourceRecently Updated
4462Profile unavailable
Coding

DevOps Ops Bot

Server health monitoring with alerts and auto-recovery. Checks CPU, memory, disk, and uptime with configurable thresholds. Sends Slack/Discord alerts and can...

Registry SourceRecently Updated
1600Profile unavailable
Coding

ops-journal

Automates logging of deployments, incidents, changes, and decisions into a searchable ops journal with incident timelines and postmortem generation.

Registry SourceRecently Updated
2890Profile unavailable
Coding

Clawhub Skill Infra Watchdog

Self-hosted infrastructure monitoring for HTTP, TCP, SSL, disk, memory, load, Docker, DNS, and custom commands with alerting via OpenClaw messaging.

Registry SourceRecently Updated
3920Profile unavailable