Toil Tracker

Find the manual work that's eating your engineering time. Toil is repetitive, automatable, tactical work that scales with service size and has no lasting value. Identify it, measure it, prioritize what to automate first, and track reduction over time.

Use when: "how much toil do we have", "what should we automate", "toil budget", "manual operational work", "repetitive tasks", "SRE toil reduction", or during quarterly planning to justify automation projects.

Commands

1. `survey` — Catalog Toil Sources

Step 1: Identify Toil Categories

Interview the team or analyze work tracking systems. Common toil categories:

Category	Examples	Signal
Deploys	Manual deploy steps, config changes, rollbacks	"Someone has to click..."
Tickets	Password resets, access requests, cert renewals	"Every week we get..."
Monitoring	False alerts, manual alert triage, dashboard watching	"We page about this but..."
Scaling	Manual capacity adjustments, resource provisioning	"When traffic spikes we..."
Data	Manual data fixes, migrations, backfills	"Users file tickets to..."
Maintenance	Dependency updates, cert rotations, key rotations	"Every quarter we have to..."
Onboarding	Setting up dev environments, granting access	"New hire setup takes..."

Step 2: Quantify Each Toil Source

# Analyze ticket systems for repetitive patterns
# Jira/Linear — find recurring ticket types
# Example: count tickets by label/type in last quarter

# Analyze on-call alerts for noise
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-04-01&statuses[]=resolved" \
  -H "Authorization: Token token=$PD_TOKEN" | python3 -c "
import json, sys, collections
incidents = json.load(sys.stdin)['incidents']
by_service = collections.Counter(i['service']['summary'] for i in incidents)
print('Incidents by service (potential toil):')
for service, count in by_service.most_common(10):
    print(f'  {count:>4}x  {service}')
"

For each toil source, estimate:

Frequency: How often does this happen? (daily, weekly, per-deploy)
Duration: How long does it take each time? (minutes, hours)
People involved: How many engineers touch this?
Scaling: Does it grow with service count, traffic, or team size?
Risk: What happens if someone does it wrong?

Step 3: Calculate Toil Budget

def calculate_toil_budget(toil_items, team_size, hours_per_quarter=520):
    """
    Google SRE recommends: max 50% of SRE time on toil.
    """
    total_toil_hours = 0

    for item in toil_items:
        quarterly_hours = item['frequency_per_quarter'] * item['hours_per_occurrence'] * item['people_involved']
        total_toil_hours += quarterly_hours
        item['quarterly_hours'] = quarterly_hours

    team_capacity = team_size * hours_per_quarter
    toil_percentage = (total_toil_hours / team_capacity) * 100

    return {
        'total_toil_hours': total_toil_hours,
        'team_capacity_hours': team_capacity,
        'toil_percentage': toil_percentage,
        'status': '🟢 Healthy' if toil_percentage < 30 else '🟡 Watch' if toil_percentage < 50 else '🔴 Over budget',
        'items_ranked': sorted(toil_items, key=lambda x: -x['quarterly_hours']),
    }

Step 4: Generate Report

# Toil Report — Q2 2026

## Summary
- Team size: 6 SREs
- Total toil: 420h/quarter (13.5h/person/week)
- Toil budget: 34% of capacity 🟡 (target: <30%)

## Top Toil Sources (ranked by hours)
| Rank | Category | Task | Freq | Duration | Hours/Q | Automatable? |
|------|----------|------|------|----------|---------|-------------|
| 1 | Tickets | Access requests | 20/week | 15 min | 65h | ✅ Self-serve portal |
| 2 | Deploys | Manual prod deploy | 3/week | 45 min | 58.5h | ✅ CI/CD pipeline |
| 3 | Monitoring | False alert triage | 10/week | 20 min | 43h | ✅ Tune thresholds |
| 4 | Data | Customer data fixes | 5/week | 30 min | 32.5h | ✅ Admin tool |
| 5 | Maintenance | Cert renewals | 12/quarter | 2h | 24h | ✅ auto-renew |

## Automation ROI
| Project | Est. Effort | Toil Saved/Q | Payback |
|---------|------------|-------------|---------|
| Self-serve access portal | 80h | 65h | 1.2 quarters |
| CD pipeline | 120h | 58.5h | 2.1 quarters |
| Alert tuning sprint | 20h | 43h | 0.5 quarters |
| Admin data tool | 60h | 32.5h | 1.8 quarters |
| Auto cert renewal | 8h | 24h | 0.3 quarters |

## Recommendation
Start with alert tuning (fastest ROI) and auto cert renewal (lowest effort). Then tackle self-serve access portal. Defer CD pipeline to Q3 (high effort but high payoff).

2. `prioritize` — Rank Automation Candidates

Score each toil source by:

Hours saved per quarter (impact)
Automation effort (cost)
Risk of manual error (safety)
Growth rate (will it get worse?)

Calculate ROI = hours_saved_per_quarter / automation_hours.

3. `track` — Monitor Toil Reduction Over Time

Compare toil hours quarter-over-quarter:

Total toil hours trending up or down?
Which automation projects delivered expected savings?
New toil sources appearing?
Toil percentage within SRE budget (< 50%)?

toil-tracker

Safety Notice

Copy this and send it to your AI assistant to learn

Toil Tracker

Commands

1. `survey` — Catalog Toil Sources

Step 1: Identify Toil Categories

Step 2: Quantify Each Toil Source

Step 3: Calculate Toil Budget

Step 4: Generate Report

2. `prioritize` — Rank Automation Candidates

3. `track` — Monitor Toil Reduction Over Time

Source Transparency

Related Skills

声音制作规范，Jiuge_Flow_Perfect_V1.skill

Report Expert

Nexlink

Prompt Wizard

toil-tracker

Safety Notice

Copy this and send it to your AI assistant to learn

Toil Tracker

Commands

1. survey — Catalog Toil Sources

Step 1: Identify Toil Categories

Step 2: Quantify Each Toil Source

Step 3: Calculate Toil Budget

Step 4: Generate Report

2. prioritize — Rank Automation Candidates

3. track — Monitor Toil Reduction Over Time

Source Transparency

Related Skills

声音制作规范，Jiuge_Flow_Perfect_V1.skill

Report Expert

Nexlink

Prompt Wizard

1. `survey` — Catalog Toil Sources

2. `prioritize` — Rank Automation Candidates

3. `track` — Monitor Toil Reduction Over Time