Toil Tracker
Find the manual work that's eating your engineering time. Toil is repetitive, automatable, tactical work that scales with service size and has no lasting value. Identify it, measure it, prioritize what to automate first, and track reduction over time.
Use when: "how much toil do we have", "what should we automate", "toil budget", "manual operational work", "repetitive tasks", "SRE toil reduction", or during quarterly planning to justify automation projects.
Commands
1. survey — Catalog Toil Sources
Step 1: Identify Toil Categories
Interview the team or analyze work tracking systems. Common toil categories:
| Category | Examples | Signal |
|---|---|---|
| Deploys | Manual deploy steps, config changes, rollbacks | "Someone has to click..." |
| Tickets | Password resets, access requests, cert renewals | "Every week we get..." |
| Monitoring | False alerts, manual alert triage, dashboard watching | "We page about this but..." |
| Scaling | Manual capacity adjustments, resource provisioning | "When traffic spikes we..." |
| Data | Manual data fixes, migrations, backfills | "Users file tickets to..." |
| Maintenance | Dependency updates, cert rotations, key rotations | "Every quarter we have to..." |
| Onboarding | Setting up dev environments, granting access | "New hire setup takes..." |
Step 2: Quantify Each Toil Source
# Analyze ticket systems for repetitive patterns
# Jira/Linear — find recurring ticket types
# Example: count tickets by label/type in last quarter
# Analyze on-call alerts for noise
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-04-01&statuses[]=resolved" \
-H "Authorization: Token token=$PD_TOKEN" | python3 -c "
import json, sys, collections
incidents = json.load(sys.stdin)['incidents']
by_service = collections.Counter(i['service']['summary'] for i in incidents)
print('Incidents by service (potential toil):')
for service, count in by_service.most_common(10):
print(f' {count:>4}x {service}')
"
For each toil source, estimate:
- Frequency: How often does this happen? (daily, weekly, per-deploy)
- Duration: How long does it take each time? (minutes, hours)
- People involved: How many engineers touch this?
- Scaling: Does it grow with service count, traffic, or team size?
- Risk: What happens if someone does it wrong?
Step 3: Calculate Toil Budget
def calculate_toil_budget(toil_items, team_size, hours_per_quarter=520):
"""
Google SRE recommends: max 50% of SRE time on toil.
"""
total_toil_hours = 0
for item in toil_items:
quarterly_hours = item['frequency_per_quarter'] * item['hours_per_occurrence'] * item['people_involved']
total_toil_hours += quarterly_hours
item['quarterly_hours'] = quarterly_hours
team_capacity = team_size * hours_per_quarter
toil_percentage = (total_toil_hours / team_capacity) * 100
return {
'total_toil_hours': total_toil_hours,
'team_capacity_hours': team_capacity,
'toil_percentage': toil_percentage,
'status': '🟢 Healthy' if toil_percentage < 30 else '🟡 Watch' if toil_percentage < 50 else '🔴 Over budget',
'items_ranked': sorted(toil_items, key=lambda x: -x['quarterly_hours']),
}
Step 4: Generate Report
# Toil Report — Q2 2026
## Summary
- Team size: 6 SREs
- Total toil: 420h/quarter (13.5h/person/week)
- Toil budget: 34% of capacity 🟡 (target: <30%)
## Top Toil Sources (ranked by hours)
| Rank | Category | Task | Freq | Duration | Hours/Q | Automatable? |
|------|----------|------|------|----------|---------|-------------|
| 1 | Tickets | Access requests | 20/week | 15 min | 65h | ✅ Self-serve portal |
| 2 | Deploys | Manual prod deploy | 3/week | 45 min | 58.5h | ✅ CI/CD pipeline |
| 3 | Monitoring | False alert triage | 10/week | 20 min | 43h | ✅ Tune thresholds |
| 4 | Data | Customer data fixes | 5/week | 30 min | 32.5h | ✅ Admin tool |
| 5 | Maintenance | Cert renewals | 12/quarter | 2h | 24h | ✅ auto-renew |
## Automation ROI
| Project | Est. Effort | Toil Saved/Q | Payback |
|---------|------------|-------------|---------|
| Self-serve access portal | 80h | 65h | 1.2 quarters |
| CD pipeline | 120h | 58.5h | 2.1 quarters |
| Alert tuning sprint | 20h | 43h | 0.5 quarters |
| Admin data tool | 60h | 32.5h | 1.8 quarters |
| Auto cert renewal | 8h | 24h | 0.3 quarters |
## Recommendation
Start with alert tuning (fastest ROI) and auto cert renewal (lowest effort). Then tackle self-serve access portal. Defer CD pipeline to Q3 (high effort but high payoff).
2. prioritize — Rank Automation Candidates
Score each toil source by:
- Hours saved per quarter (impact)
- Automation effort (cost)
- Risk of manual error (safety)
- Growth rate (will it get worse?)
Calculate ROI = hours_saved_per_quarter / automation_hours.
3. track — Monitor Toil Reduction Over Time
Compare toil hours quarter-over-quarter:
- Total toil hours trending up or down?
- Which automation projects delivered expected savings?
- New toil sources appearing?
- Toil percentage within SRE budget (< 50%)?