error-budget-tracker

Track SLO error budgets across services. Calculate remaining budget from SLI metrics, alert on budget burn rate, recommend development vs reliability investment, and generate error budget reports for stakeholder review.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "error-budget-tracker" with this command: npx skills add charlie-morrison/error-budget-tracker

Error Budget Tracker

Make SLOs actionable. Track error budget consumption across services, calculate burn rates, predict when budgets will exhaust, and provide clear guidance on whether to ship features or invest in reliability — turning abstract availability targets into concrete engineering decisions.

Use when: "track error budget", "SLO status", "how much error budget is left", "should we freeze deploys", "reliability vs velocity", "SLI/SLO review", or during service review meetings.

Commands

1. track — Calculate Current Error Budget

Step 1: Define SLOs

# SLO definitions (store in repo as slo.yaml)
services:
  api-gateway:
    slos:
      - name: Availability
        target: 99.9%          # 43.8 min/month downtime budget
        sli: "1 - (sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])))"
        window: 30d             # Rolling 30-day window
      - name: Latency P99
        target: 99%             # 99% of requests under 500ms
        sli: "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) < 0.5"
        window: 30d
  payment-service:
    slos:
      - name: Availability
        target: 99.95%         # 21.9 min/month downtime budget
        sli: "..."

Step 2: Query Current SLI Values

# Prometheus — current availability over rolling window
curl -s "$PROMETHEUS_URL/api/v1/query" --data-urlencode \
  "query=1 - (sum(increase(http_requests_total{service='api-gateway',status=~'5..'}[30d])) / sum(increase(http_requests_total{service='api-gateway'}[30d])))" | \
  python3 -c "
import json, sys
result = json.load(sys.stdin)
if result['data']['result']:
    sli = float(result['data']['result'][0]['value'][1])
    slo = 0.999
    budget_total = 1 - slo  # 0.001 = 0.1%
    budget_consumed = max(0, slo - sli) / budget_total * 100 if sli < slo else 0
    budget_remaining = max(0, 100 - budget_consumed)
    
    print(f'SLI (30d): {sli*100:.3f}%')
    print(f'SLO target: {slo*100:.1f}%')
    print(f'Error budget: {budget_remaining:.1f}% remaining')
    
    # Convert to minutes
    minutes_total = 30 * 24 * 60 * (1 - slo)  # 43.2 min for 99.9%
    minutes_used = minutes_total * (budget_consumed / 100)
    minutes_left = minutes_total - minutes_used
    print(f'Budget in minutes: {minutes_left:.1f} min remaining of {minutes_total:.1f} min')
    
    status = '🟢' if budget_remaining > 50 else '🟡' if budget_remaining > 20 else '🔴'
    print(f'Status: {status}')
"

Step 3: Calculate Burn Rate

def calculate_burn_rate(budget_consumed_pct, days_elapsed, window_days=30):
    """How fast is the error budget being consumed?"""
    daily_burn = budget_consumed_pct / max(days_elapsed, 1)
    days_remaining = (100 - budget_consumed_pct) / daily_burn if daily_burn > 0 else float('inf')
    
    # Burn rate relative to expected (even burn = 1.0)
    expected_daily = 100 / window_days
    burn_rate = daily_burn / expected_daily
    
    return {
        'daily_burn_pct': daily_burn,
        'burn_rate': burn_rate,  # 1.0 = on track, >1 = burning fast
        'days_until_exhaustion': days_remaining,
        'alert': 'CRITICAL' if burn_rate > 10 else 'HIGH' if burn_rate > 5 else 'WARNING' if burn_rate > 2 else 'OK'
    }

Step 4: Generate Report

# Error Budget Report — April 2026

## Executive Summary
- 3/5 services within budget ✅
- 1 service approaching exhaustion ⚠️
- 1 service budget exhausted 🔴 — deploy freeze recommended

## Service Status
| Service | SLO | SLI (30d) | Budget Left | Burn Rate | Action |
|---------|-----|-----------|-------------|-----------|--------|
| api-gateway | 99.9% | 99.92% | 78% 🟢 | 0.7× | Ship features |
| payment | 99.95% | 99.94% | 35% 🟡 | 1.3× | Caution |
| search | 99.5% | 99.48% | 12% 🔴 | 2.8× | Reliability sprint |
| auth | 99.99% | 99.995% | 95% 🟢 | 0.2× | Ship features |
| notifications | 99.9% | 99.85% | -50% 🔴 | 3.5× | Deploy freeze |

## Recommendations
### notifications (BUDGET EXHAUSTED)
- Freeze non-critical deploys until budget recovers
- Dedicate 1 engineer to reliability for 2 weeks
- Root cause: 3 incidents on Apr 12, 18, 23 consumed 150% of budget
- Projected recovery: 12 days if no further incidents

### search (LOW BUDGET)
- Defer risky refactors until next month
- Current burn rate exhausts budget in 4 days
- Root cause: elevated latency from new search index migration

2. alert — Set Up Budget Burn Alerts

Generate multi-window burn rate alerts (Google SRE book approach):

  • 2% budget consumed in 1 hour → page (14.4× burn rate)
  • 5% budget consumed in 6 hours → page (6× burn rate)
  • 10% budget consumed in 3 days → ticket (1× burn rate)

3. policy — Generate Error Budget Policy

Create a formal error budget policy document:

  • What happens at each budget threshold (100%, 75%, 50%, 25%, 0%)
  • Who has authority to freeze deploys
  • How to request budget exceptions
  • How budget resets (rolling window vs calendar month)
  • How to adjust SLOs based on historical data

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Ops Code Review

Code Review 安全扫描工具,自动化代码审计,支持 Django/Python、React+TypeScript、PHP 多语言。 自动识别代码仓库提交变更,调用安全扫描器进行规范检查和风险检测,报告推送飞书群。 支持 post-commit hook 增量扫描和定时全量扫描。 关键词:Code Revi...

Registry SourceRecently Updated
Coding

drivectl - your command-line tool for interacting with Google Drive

Interact with Google Drive, Docs, and Sheets using the drivectl CLI. Use this skill when asked to list Drive files, download files, read/update Sheets, or cr...

Registry SourceRecently Updated
Coding

Server Monitor Collector

Collect server monitoring data (Zabbix / Prometheus / Alibaba / Tencent / Huawei Cloud), generate CSV/XLSX reports and send via email or Feishu.

Registry SourceRecently Updated
Coding

🤖 GitHub自动管家

自动化管理GitHub仓库、PR、Issue、CI/CD。无需API Key,安装即用。

Registry SourceRecently Updated