Introduction
Toil is manual, repetitive, automatable work that scales linearly with service growth. It’s the operational burden that keeps engineers from doing valuable engineering work. Reducing toil is essential for scaling both systems and teams effectively.
What is Toil?
Google’s SRE Definition
Toil has the following characteristics:
- Manual - Requires human action
- Repetitive - Done over and over
- Automatable - Could be automated
- Tactical - Reactive, interrupt-driven
- No enduring value - Doesn’t improve the system
- Scales linearly - Grows with service growth
Toil vs Engineering Work
Toil (eliminate this):
- Manually restarting failed services
- Copy-pasting deployment commands
- Manually creating user accounts
- Responding to repeated alerts
- Manual data backups
- Ticket-driven provisioning
Engineering work (more of this):
- Building automation
- Improving architecture
- Writing code
- Capacity planning
- Performance optimization
- Creating self-service tools
Why Toil Matters
Impact on teams:
- Burnout and low morale
- No time for innovation
- Poor work-life balance
- Limited career growth
- High turnover
Impact on business:
- Slower feature delivery
- Increased incidents
- Cannot scale operations
- Higher operational costs
- Technical debt accumulates
The 50% rule: SRE teams should spend no more than 50% of their time on toil. The other 50% should be engineering work that reduces future toil.
Identifying Toil
Toil Inventory Exercise
Step 1: Track your time
Keep a log for 2 weeks:
Date | Task | Time | Category
-----------|-------------------------|------|----------
2025-10-15 | Restart crashed pods | 30m | Toil
2025-10-15 | Deploy new feature | 2h | Engineering
2025-10-15 | Manually scale database | 45m | Toil
2025-10-15 | Respond to same alert | 20m | Toil
2025-10-16 | Design autoscaling | 3h | Engineering
2025-10-16 | Manual backup restore | 1h | Toil
Step 2: Categorize tasks
# Simple toil calculator
tasks = [
{"name": "Restart services", "time_per_week": 2.0, "automatable": True},
{"name": "Deploy manually", "time_per_week": 3.0, "automatable": True},
{"name": "Create user accounts", "time_per_week": 1.5, "automatable": True},
{"name": "Respond to alerts", "time_per_week": 4.0, "automatable": True},
{"name": "Manual testing", "time_per_week": 2.5, "automatable": True},
]
total_toil = sum(t["time_per_week"] for t in tasks if t["automatable"])
work_hours = 40
toil_percentage = (total_toil / work_hours) * 100
print(f"Total toil: {total_toil} hours/week ({toil_percentage:.1f}%)")
# Output: Total toil: 13.0 hours/week (32.5%)
Step 3: Prioritize by impact
Calculate toil score:
Toil Score = (Time per Week) × (Frequency) × (Pain Factor)
Task | Time | Freq/Week | Pain | Score |
---|---|---|---|---|
Manual deployments | 30m | 10 | High (3) | 150 |
Restart crashed pods | 15m | 20 | Med (2) | 60 |
Create DB users | 20m | 5 | Low (1) | 20 |
Manual backups | 45m | 1 | Med (2) | 18 |
Priority: Automate manual deployments first.
Common Sources of Toil
Infrastructure:
- Manual server provisioning
- Hand-rolled deployments
- Manual scaling operations
- Manual certificate renewals
- Manual backup/restore
Operations:
- Responding to repeated alerts
- Manual health checks
- Log diving for common issues
- Restarting failed services
- Clearing disk space
Development:
- Manual testing
- Code review reminders
- Manual dependency updates
- Environment setup
- Build and deployment
Support:
- Password resets
- Permission grants
- Account provisioning
- Data exports
- Configuration changes
Measuring Toil
Key Metrics
metrics:
- name: toil_percentage
calculation: "toil_hours / total_work_hours * 100"
target: "<50%"
measure: "Weekly"
- name: toil_by_category
categories: ["incidents", "deployments", "provisioning", "other"]
measure: "Weekly"
- name: automation_roi
calculation: "time_saved / time_invested"
target: ">5x"
measure: "Per project"
- name: mean_time_to_automate
calculation: "time from identification to automation"
target: "<30 days"
measure: "Per toil item"
Tracking Dashboard
Grafana/Datadog dashboard:
{
"dashboard": {
"title": "Toil Tracking",
"panels": [
{
"title": "Toil Percentage by Week",
"type": "graph",
"target": "toil_hours / 40 * 100"
},
{
"title": "Toil by Category",
"type": "pie",
"categories": ["Incidents", "Deployments", "Provisioning", "Other"]
},
{
"title": "Top Toil Tasks",
"type": "table",
"columns": ["Task", "Hours/Week", "Frequency", "Status"]
},
{
"title": "Automation ROI",
"type": "stat",
"calculation": "total_time_saved / total_automation_effort"
}
]
}
}
Toil Budget
Set team limits:
team: platform-sre
toil_budget:
max_percentage: 50
current: 35
status: "healthy"
weekly_breakdown:
engineering_work: 26 hours (65%)
toil: 14 hours (35%)
- incident_response: 6 hours
- deployments: 4 hours
- provisioning: 2 hours
- other: 2 hours
actions:
- status: "on_track"
message: "Toil within acceptable limits"
alerts:
- threshold: 50
action: "Stop new projects, focus on automation"
- threshold: 60
action: "Emergency automation sprint"
Elimination Strategies
1. Automation
ROI calculation:
def automation_roi(
manual_time_per_occurrence,
occurrences_per_week,
automation_build_time,
maintenance_time_per_week=0
):
"""
Calculate return on investment for automation
Returns weeks until breakeven
"""
weekly_savings = manual_time_per_occurrence * occurrences_per_week
weekly_cost = maintenance_time_per_week
net_weekly_savings = weekly_savings - weekly_cost
if net_weekly_savings <= 0:
return float('inf') # Never breaks even
breakeven_weeks = automation_build_time / net_weekly_savings
annual_roi = (net_weekly_savings * 52) / automation_build_time
return {
'breakeven_weeks': breakeven_weeks,
'annual_roi': annual_roi,
'worth_automating': breakeven_weeks < 12 # Less than 3 months
}
# Example: Manual deployment
result = automation_roi(
manual_time_per_occurrence=30/60, # 30 minutes
occurrences_per_week=10, # 10 deploys/week
automation_build_time=16, # 2 days to build
maintenance_time_per_week=0.5 # 30 min/week maintenance
)
print(f"Breakeven: {result['breakeven_weeks']:.1f} weeks")
print(f"Annual ROI: {result['annual_roi']:.1f}x")
print(f"Worth it: {result['worth_automating']}")
# Output:
# Breakeven: 3.6 weeks
# Annual ROI: 14.4x
# Worth it: True
Automation priorities:
High frequency, high time - Automate first
- Example: Daily deployments taking 30 min each
High frequency, low time - Good candidate
- Example: Restarting services (5 min, 20x/week)
Low frequency, high time - Maybe automate
- Example: Quarterly disaster recovery test (4 hours)
Low frequency, low time - Keep manual
- Example: Annual SSL cert renewal (15 minutes)
2. Self-Service Tools
Internal developer platform:
# Example: Self-service database provisioning
# Before (toil):
process: "Slack message → SRE creates DB → Grants permissions → Notifies dev"
time: 2 hours
friction: High
# After (self-service):
process: "Developer runs: kubectl apply -f db-request.yaml"
time: 5 minutes (automated)
friction: Low
Implementation:
# Self-service database creation tool
#!/bin/bash
# File: create-database.sh
DB_NAME=$1
OWNER=$2
if [[ -z "$DB_NAME" || -z "$OWNER" ]]; then
echo "Usage: create-database.sh <db-name> <owner-email>"
exit 1
fi
# Validate request
if [[ ! "$OWNER" =~ @company.com$ ]]; then
echo "Error: Owner must be @company.com email"
exit 1
fi
# Create database via IaC
cat <<EOF > terraform/databases/${DB_NAME}.tf
resource "aws_db_instance" "${DB_NAME}" {
identifier = "${DB_NAME}"
engine = "postgres"
instance_class = "db.t3.micro"
tags = {
Owner = "${OWNER}"
CreatedBy = "self-service"
}
}
EOF
# Apply terraform
cd terraform/databases
terraform apply -auto-approve
# Grant permissions
OWNER_USER=$(echo $OWNER | cut -d@ -f1)
psql -c "GRANT ALL ON DATABASE ${DB_NAME} TO ${OWNER_USER};"
# Notify via Slack
curl -X POST $SLACK_WEBHOOK -d "{
\"text\": \"✅ Database ${DB_NAME} created for ${OWNER}\"
}"
echo "Database ${DB_NAME} ready! Connection info sent to ${OWNER}"
3. Improved Tooling
Before:
# Manual deployment (20 steps, error-prone)
ssh server1
cd /app
git pull
npm install
npm run build
pm2 restart app
# ... repeat for 10 servers
After:
# One-command deployment
deploy-app production v2.3.0
Tool characteristics:
- Idempotent - Safe to run multiple times
- Validated - Checks prerequisites
- Logged - Audit trail
- Atomic - All or nothing
- Rollback-able - Easy to undo
4. Process Improvements
Example: Alert fatigue
Before:
- 200 alerts/day
- 10% actionable
- 2 hours/day responding
Process improvements:
- Tune thresholds - Reduce noise
- Group related alerts - One page, not 10
- Add context - Include runbook link in alert
- Auto-remediate - Script fixes common issues
After:
- 20 alerts/day
- 80% actionable
- 30 min/day responding
Configuration:
# Alert tuning example
alerts:
- name: HighMemoryUsage
# Before: threshold: 70%
threshold: 85% # Reduced noise
for: 10m # Require sustained issue
annotations:
runbook: "https://runbook/memory-issues"
description: "Memory >85% for 10 min. Check runbook before paging."
- name: DiskSpaceLow
# Auto-remediation
threshold: 80%
actions:
- run: "/scripts/cleanup-logs.sh"
- notify: "slack"
- page: "only_if_script_fails"
5. Eliminate Root Causes
Reactive approach (toil):
- Service crashes → Restart manually
- Disk fills up → Clean manually
- Cert expires → Renew manually
Proactive approach (engineering):
- Service crashes → Fix bug, add health checks, auto-restart
- Disk fills up → Log rotation, monitoring, auto-cleanup
- Cert expires → Automated renewal, monitoring
Example: Pod crashes
Toil approach:
# Manually restart pods when they crash
# Time: 10 min, 5x/week = 50 min/week toil
kubectl delete pod crash-pod-xyz
Engineering approach:
# 1. Fix application bug causing crashes
# 2. Add proper health checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# 3. Configure automatic restart policy
restartPolicy: Always
# 4. Add resource limits to prevent OOM
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"
# 5. Add monitoring to catch issues early
# Result: Pods self-heal, no manual intervention
Automation Examples
Example 1: Automated Deployments
Before (manual):
# 15 steps, 30 minutes, error-prone
ssh prod-server-1
cd /app
git pull origin main
npm install
npm run build
pm2 restart app
# Test manually
# Repeat for 9 more servers
# Update load balancer
# Update documentation
After (automated):
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run tests
run: npm test
- name: Build
run: npm run build
- name: Deploy with Ansible
run: |
ansible-playbook -i inventory/production deploy.yml
- name: Smoke tests
run: |
curl -f https://api.company.com/health
- name: Notify Slack
run: |
curl -X POST $SLACK_WEBHOOK -d '{"text":"✅ Deployed to production"}'
Time savings:
- Manual: 30 min × 10 deploys/week = 5 hours/week
- Automated: 5 min monitoring × 10 deploys/week = 50 min/week
- Savings: 4.2 hours/week (21 hours/month)
Example 2: Auto-Remediation
Scenario: Disk space cleanup
Toil (manual):
# Every week, manually clean up disk space
# Time: 45 min/week
ssh prod-server-1
df -h # Check disk usage
find /var/log -name "*.log" -mtime +30 -delete
find /tmp -mtime +7 -delete
docker system prune -af
Automation:
#!/bin/bash
# File: /etc/cron.daily/cleanup-disk.sh
# Runs daily via cron
THRESHOLD=80
DISK_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt $THRESHOLD ]; then
echo "Disk usage ${DISK_USAGE}% exceeds ${THRESHOLD}%"
# Clean old logs
find /var/log -name "*.log" -mtime +30 -delete
echo "Cleaned old log files"
# Clean temp files
find /tmp -mtime +7 -delete
echo "Cleaned temp files"
# Docker cleanup
docker system prune -af --volumes
echo "Cleaned Docker resources"
# Check if successful
NEW_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $NEW_USAGE -gt $THRESHOLD ]; then
# Still high, alert human
curl -X POST $SLACK_WEBHOOK -d "{
\"text\": \"⚠️ Disk cleanup ran but usage still ${NEW_USAGE}%\"
}"
else
echo "Disk usage now ${NEW_USAGE}%"
fi
fi
Kubernetes approach:
# Automated with log rotation and volume limits
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
volumeMounts:
- name: logs
mountPath: /var/log
volumes:
- name: logs
emptyDir:
sizeLimit: 1Gi # Automatic limit
---
# Log rotation sidecar
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
# Main container
- name: log-rotator
image: blacklabelops/logrotate
volumeMounts:
- name: logs
mountPath: /var/log
Time savings: 45 min/week → 0 min/week = 3 hours/month
Example 3: Chatops for Common Tasks
Toil: Responding to “Can you check…” requests
Solution: Chatbot in Slack
# Slack bot for common operations
from slack_bolt import App
import subprocess
app = App(token=SLACK_BOT_TOKEN)
@app.command("/check-service")
def check_service(ack, command, say):
ack()
service = command['text']
# Run health check
result = subprocess.run(
f"kubectl get pods -l app={service}",
shell=True,
capture_output=True,
text=True
)
say(f"```{result.stdout}```")
@app.command("/deploy")
def deploy_service(ack, command, say):
ack()
# Parse: /deploy myapp v2.3.0 production
parts = command['text'].split()
if len(parts) != 3:
say("Usage: /deploy <service> <version> <environment>")
return
service, version, env = parts
# Trigger deployment
say(f"🚀 Deploying {service}@{version} to {env}...")
result = subprocess.run(
f"./deploy.sh {service} {version} {env}",
shell=True,
capture_output=True,
text=True
)
if result.returncode == 0:
say(f"✅ Deployment successful!")
else:
say(f"❌ Deployment failed:\n```{result.stderr}```")
@app.command("/scale")
def scale_service(ack, command, say):
ack()
# Parse: /scale myapp 10
parts = command['text'].split()
service, replicas = parts[0], parts[1]
subprocess.run(
f"kubectl scale deployment/{service} --replicas={replicas}",
shell=True
)
say(f"📈 Scaled {service} to {replicas} replicas")
app.start(port=3000)
Time savings:
- Before: 5 min per request × 20 requests/week = 100 min/week
- After: Instant self-service
- Savings: 1.7 hours/week (6.7 hours/month)
Example 4: Infrastructure as Code
Toil: Manual server provisioning
Before:
1. Fill out request form
2. Wait for approval (1-3 days)
3. SRE manually creates server via web console
4. SRE manually configures networking
5. SRE manually installs software
6. SRE manually configures monitoring
7. SRE updates documentation
8. Total time: 2-4 hours of SRE time + 1-3 days waiting
After (Terraform):
# File: servers.tf
module "application_server" {
source = "./modules/server"
name = "app-server-prod-1"
instance_type = "t3.large"
environment = "production"
monitoring_enabled = true
backup_enabled = true
tags = {
Team = "platform"
Owner = "[email protected]"
}
}
# Apply with: terraform apply
# Time: 5 minutes (automated)
Self-service request:
# Developer makes PR to add their server
# PR approved → GitHub Actions applies Terraform
# Server provisioned automatically in 5 minutes
Time savings:
- Before: 2 hours × 5 requests/week = 10 hours/week
- After: 15 min review × 5 requests/week = 1.25 hours/week
- Savings: 8.75 hours/week (35 hours/month)
Implementation Roadmap
Phase 1: Measure (Week 1-2)
**Goals:**
- Quantify current toil
- Identify top toil sources
- Build team awareness
**Actions:**
- [ ] Track time for 2 weeks
- [ ] Categorize all tasks (toil vs engineering)
- [ ] Calculate toil percentage
- [ ] Identify top 10 toil tasks
- [ ] Present findings to team
**Deliverable:** Toil inventory spreadsheet
Phase 2: Quick Wins (Week 3-6)
**Goals:**
- Build automation momentum
- Demonstrate ROI
- Get team buy-in
**Actions:**
- [ ] Pick 3 high-frequency, high-pain tasks
- [ ] Automate or eliminate them
- [ ] Measure time savings
- [ ] Share success stories
**Example quick wins:**
- Automated deployment script
- Self-service user provisioning
- Alert auto-remediation for common issues
Phase 3: Systematic Reduction (Month 2-3)
**Goals:**
- Reduce toil to <50%
- Build sustainable practices
- Create reusable tools
**Actions:**
- [ ] Automate top 10 toil tasks
- [ ] Build internal developer platform
- [ ] Implement IaC for all infrastructure
- [ ] Create runbooks with auto-remediation
- [ ] Set up toil tracking dashboard
**Deliverable:** 50% reduction in toil percentage
Phase 4: Continuous Improvement (Ongoing)
**Goals:**
- Maintain low toil levels
- Prevent new toil
- Scale operations without scaling toil
**Actions:**
- [ ] Weekly toil reviews in team meetings
- [ ] Automation-first mindset for new work
- [ ] Toil budget enforcement
- [ ] Quarterly toil audits
- [ ] Share automation across teams
**Metrics:**
- Toil percentage stays <50%
- Automation ROI >5x
- Team satisfaction improving
Organizational Support
Building the Case for Toil Reduction
Executive presentation:
# Toil Reduction Initiative
## Current State
- Engineers spend 60% of time on toil
- 40% time on engineering projects
- High team turnover (3 people left citing "boring work")
- Slower feature delivery
## Proposal
- Invest 3 months in toil automation
- Goal: Reduce toil to <50%
- 20% time allocated to automation projects
## Expected ROI
- Free up 15 hours/week per engineer (1.5 months/year)
- Faster incident response
- Higher team morale
- Ability to scale without hiring
## Cost-Benefit
- Investment: $50K (3 months × 2 engineers)
- Annual savings: $200K (freed capacity)
- ROI: 4x in first year
- Payback period: 3 months
Team Practices
Make toil visible:
# Sprint planning
toil_capacity: 20 hours (50% of sprint)
engineering_capacity: 20 hours (50% of sprint)
sprint_backlog:
toil:
- On-call rotation (8 hours)
- Incident response (6 hours)
- Manual deployments (4 hours)
- Support tickets (2 hours)
engineering:
- Automate deployment (8 hours)
- Build self-service tool (8 hours)
- Refactor monitoring (4 hours)
Toil reduction sprints:
- Dedicate entire sprint to automation
- No feature work, only toil elimination
- Run quarterly
Celebrate wins:
# Team meeting: Toil Win of the Week
🏆 This week: Alice automated database backups
- Time saved: 2 hours/week
- Annual savings: 104 hours
- ROI: Built in 8 hours, 13x return
Next targets:
- Manual certificate renewals (Bob investigating)
- Log analysis for common errors (Carol prototyping)
Common Pitfalls
Pitfall 1: Automating Without Understanding
Problem:
# "Let's automate this!"
# Creates script that breaks production
Solution:
- Understand the manual process fully
- Document edge cases
- Test thoroughly in non-prod
- Gradual rollout
- Keep manual process as fallback
Pitfall 2: Over-Engineering
Problem:
Task takes 10 minutes/month to do manually
Spend 3 months building complex automation system
Solution:
- Calculate ROI before building
- Start simple (bash script before building platform)
- Iterate based on actual usage
Pitfall 3: Automation Debt
Problem:
Built 20 automation scripts
No documentation
No maintenance
Scripts break, create more toil
Solution:
- Document all automation
- Version control everything
- Assign ownership
- Regular maintenance
- Treat automation as production code
Pitfall 4: Ignoring Toil Budget
Problem:
"We're too busy to automate" (while drowning in toil)
Toil grows unchecked to 80% of time
Solution:
- Enforce 50% toil budget
- Stop feature work when over budget
- Make toil reduction mandatory
Measuring Success
Key Performance Indicators
kpis:
- name: toil_percentage
current: 60%
target: "<50%"
trend: "decreasing"
- name: automation_count
description: "Number of tasks automated this quarter"
current: 15
target: 20
- name: time_saved
description: "Hours saved per week from automation"
current: 25
target: 40
- name: team_satisfaction
measurement: "Quarterly survey score"
current: 6.5/10
target: ">8/10"
- name: incident_mttr
description: "Mean time to resolution"
current: 45_min
target: "<30 min"
note: "Improved through automation"
Before/After Comparison
## 6-Month Results
### Time Allocation
Before:
- Toil: 60% (24 hours/week)
- Engineering: 40% (16 hours/week)
After:
- Toil: 35% (14 hours/week)
- Engineering: 65% (26 hours/week)
**Result: 10 hours/week freed up per engineer**
### Specific Improvements
| Task | Before | After | Savings |
|------|--------|-------|---------|
| Deployments | 5h/week | 30m/week | 4.5h/week |
| Server provisioning | 8h/week | 1h/week | 7h/week |
| Certificate renewals | 2h/week | 0 (automated) | 2h/week |
| Log analysis | 4h/week | 30m/week | 3.5h/week |
| Disk cleanup | 1h/week | 0 (automated) | 1h/week |
**Total: 18 hours/week saved across team**
### Business Impact
- Feature delivery velocity: +40%
- Incidents per month: -30%
- Time to resolve incidents: -35%
- Team turnover: 0 (was 3/year)
- Employee satisfaction: 6.5 → 8.2/10
Tools and Resources
Automation Tools
Infrastructure:
- Terraform / Pulumi - Infrastructure as Code
- Ansible / Chef - Configuration management
- Kubernetes operators - Self-healing systems
CI/CD:
- GitHub Actions / GitLab CI
- Jenkins / CircleCI
- ArgoCD / Flux - GitOps
Monitoring & Auto-remediation:
- Prometheus + Alertmanager
- PagerDuty / Opsgenie
- Rundeck - Automation platform
Chatops:
- Hubot / Slack Bolt SDK
- Atlassian Stride
- Custom Slack/Discord bots
Useful Scripts
# Calculate toil from time tracking
#!/bin/bash
# File: calculate-toil.sh
CSV_FILE="time-tracking.csv"
total_hours=$(awk -F',' '{sum+=$3} END {print sum}' $CSV_FILE)
toil_hours=$(awk -F',' '$4=="toil" {sum+=$3} END {print sum}' $CSV_FILE)
toil_percentage=$(echo "scale=1; $toil_hours / $total_hours * 100" | bc)
echo "Total hours: $total_hours"
echo "Toil hours: $toil_hours"
echo "Toil percentage: $toil_percentage%"
if (( $(echo "$toil_percentage > 50" | bc -l) )); then
echo "⚠️ Toil exceeds 50% target!"
else
echo "✅ Toil within acceptable range"
fi
Conclusion
Toil reduction is not a one-time project—it’s a continuous practice. Key takeaways:
- Measure first - You can’t improve what you don’t measure
- Calculate ROI - Automate high-impact tasks first
- Start small - Quick wins build momentum
- Enforce budgets - Keep toil below 50%
- Celebrate successes - Share wins, build automation culture
- Think long-term - Prevent toil, don’t just treat symptoms
- Make it cultural - Automation-first mindset for all work
Remember: Time spent on toil is time not spent on innovation, improvement, and engineering. Every hour of toil eliminated is an hour gained for valuable work.
“The best time to automate toil was yesterday. The second best time is today.”