slo-implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "slo-implementation" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-slo-implementation

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets

  • Measure user-perceived reliability

  • Implement error budgets

  • Create SLO-based alerts

  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement

Defining SLIs

Common SLI Types

  1. Availability SLI

Successful requests / Total requests

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  1. Latency SLI

Requests below latency threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

  1. Durability SLI

Successful writes / Total writes

sum(storage_writes_successful_total) / sum(storage_writes_total)

Setting SLO Targets

Availability SLO Examples

SLO % Downtime/Month Downtime/Year

99% 7.2 hours 3.65 days

99.9% 43.2 minutes 8.76 hours

99.95% 21.6 minutes 4.38 hours

99.99% 4.32 minutes 52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations

  • Business requirements

  • Current performance

  • Cost of reliability

  • Competitor benchmarks

Example SLOs:

slos:

  • name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  • name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability

  • Error Budget: 0.1% = 43.2 minutes/month

  • Current Error: 0.05% = 21.6 minutes/month

  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy:

  • remaining_budget: 100% action: Normal development velocity
  • remaining_budget: 50% action: Consider postponing risky changes
  • remaining_budget: 10% action: Freeze non-critical changes
  • remaining_budget: 0% action: Feature freeze, focus on reliability

SLO Implementation

Prometheus Recording Rules

SLI Recording Rules

groups:

  • name: sli_rules interval: 30s rules:

    Availability SLI

    • record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

    Latency SLI (requests < 500ms)

    • record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
  • name: slo_rules interval: 5m rules:

    SLO compliance (1 = meeting SLO, 0 = violating)

    • record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999

    • record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99

    Error budget remaining (percentage)

    • record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

    Error budget burn rate

    • record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)

SLO Alerting Rules

groups:

  • name: slo_alerts interval: 1m rules:

    Fast burn: 14.4x rate, 1 hour window

    Consumes 2% error budget in 1 hour

    • alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate"

    Slow burn: 6x rate, 6 hour window

    Consumes 5% error budget in 6 hours

    • alert: SLOErrorBudgetBurnSlow expr: | slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn detected" description: "Error budget burning at {{ $value }}x rate"

    Error budget exhausted

    • alert: SLOErrorBudgetExhausted expr: slo:http_availability:error_budget_remaining < 0 for: 5m labels: severity: critical annotations: summary: "SLO error budget exhausted" description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘

Example Queries:

Current SLO compliance

sli:http_availability:ratio * 100

Error budget remaining

slo:http_availability:error_budget_remaining

Days until error budget exhausted (at current burn rate)

(slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

Combination of short and long windows reduces false positives

rules:

  • alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance

  • Error budget status

  • Trend analysis

  • Incident impact

Monthly Review

  • SLO achievement

  • Error budget usage

  • Incident postmortems

  • SLO adjustments

Quarterly Review

  • SLO relevance

  • Target adjustments

  • Process improvements

  • Tooling enhancements

Best Practices

  • Start with user-facing services

  • Use multiple SLIs (availability, latency, etc.)

  • Set achievable SLOs (don't aim for 100%)

  • Implement multi-window alerts to reduce noise

  • Track error budget consistently

  • Review SLOs regularly

  • Document SLO decisions

  • Align with business goals

  • Automate SLO reporting

  • Use SLOs for prioritization

Related Skills

  • prometheus-configuration

  • For metric collection

  • grafana-dashboards

  • For SLO visualization

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

technical-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

spec-driven-brainstorming

No summary provided by upstream source.

Repository SourceNeeds Review
General

kafka-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
General

docusaurus

No summary provided by upstream source.

Repository SourceNeeds Review