operating-production-services

Operating Production Services

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "operating-production-services" with this command: npx skills add mjunaidca/mjs-agent-skills/mjunaidca-mjs-agent-skills-operating-production-services

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

Need Go To

Define reliability targets SLOs & Error Budgets

Write incident report Postmortem Templates

Set up SLO alerting references/slo-alerting.md

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

Availability: successful requests / total requests

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

Latency: requests below threshold / total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

SLO % Downtime/Month Downtime/Year

99% 7.2 hours 3.65 days

99.9% 43 minutes 8.76 hours

99.95% 22 minutes 4.38 hours

99.99% 4.3 minutes 52 minutes

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy:

Budget Remaining Action

50% Normal velocity

10-50% Postpone risky changes

< 10% Freeze non-critical changes

0% Feature freeze, fix reliability

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

Blame-Focused Blameless

"Who caused this?" "What conditions allowed this?"

Punish individuals Improve systems

Hide information Share learnings

When to Write Postmortems

  • SEV1/SEV2 incidents

  • Customer-facing outages > 15 minutes

  • Data loss or security incidents

  • Near-misses that could have been severe

  • Novel failure modes

Standard Template

Postmortem: [Incident Title]

Date: YYYY-MM-DD | Duration: X min | Severity: SEVX

Executive Summary

One paragraph: what happened, impact, root cause, resolution.

Timeline (UTC)

TimeEvent
HH:MMFirst alert fired
HH:MMOn-call acknowledged
HH:MMRoot cause identified
HH:MMFix deployed
HH:MMService recovered

Root Cause Analysis

5 Whys

  1. Why did service fail? → [Answer]
  2. Why did [1] happen? → [Answer]
  3. Why did [2] happen? → [Answer]
  4. Why did [3] happen? → [Answer]
  5. Why did [4] happen? → [Root cause]

Impact

  • Customers affected: X
  • Duration: X minutes
  • Revenue impact: $X
  • Support tickets: X

Action Items

PriorityActionOwnerDueTicket
P0[Immediate fix]@nameDateXXX-123
P1[Prevent recurrence]@nameDateXXX-124
P2[Improve detection]@nameDateXXX-125

Quick Template (Minor Incidents)

Quick Postmortem: [Title]

Date: YYYY-MM-DD | Duration: X min | Severity: SEV3

What Happened

One sentence description.

Timeline

  • HH:MM - Trigger
  • HH:MM - Detection
  • HH:MM - Resolution

Root Cause

One sentence.

Fix

  • Immediate: [What was done]
  • Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

  • Opening (5 min) - Remind: "We're here to learn, not blame"

  • Timeline (15 min) - Walk through events chronologically

  • Analysis (20 min) - What failed? Why? What allowed it?

  • Action Items (15 min) - Prioritize, assign owners, set dates

  • Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

  • Redirect blame to systems: "What made this mistake possible?"

  • Time-box tangents

  • Document dissenting views

  • Encourage quiet participants

Anti-Patterns

Don't Do Instead

Aim for 100% SLO Accept error budget exists

Skip small incidents Small incidents reveal patterns

Orphan action items Every item needs owner + date + ticket

Blame individuals Ask "what conditions allowed this?"

Create busywork actions Actions should prevent recurrence

Verification

Run: python scripts/verify.py

References

  • references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

working-with-spreadsheets

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

browsing-with-playwright

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

working-with-documents

No summary provided by upstream source.

Repository SourceNeeds Review