Introduction

Toil is manual, repetitive, automatable work that scales linearly with service growth. It’s the operational burden that keeps engineers from doing valuable engineering work. Reducing toil is essential for scaling both systems and teams effectively.

What is Toil?

Google’s SRE Definition

Toil has the following characteristics:

  1. Manual - Requires human action
  2. Repetitive - Done over and over
  3. Automatable - Could be automated
  4. Tactical - Reactive, interrupt-driven
  5. No enduring value - Doesn’t improve the system
  6. Scales linearly - Grows with service growth

Toil vs Engineering Work

Toil (eliminate this):

  • Manually restarting failed services
  • Copy-pasting deployment commands
  • Manually creating user accounts
  • Responding to repeated alerts
  • Manual data backups
  • Ticket-driven provisioning

Engineering work (more of this):

  • Building automation
  • Improving architecture
  • Writing code
  • Capacity planning
  • Performance optimization
  • Creating self-service tools

Why Toil Matters

Impact on teams:

  • Burnout and low morale
  • No time for innovation
  • Poor work-life balance
  • Limited career growth
  • High turnover

Impact on business:

  • Slower feature delivery
  • Increased incidents
  • Cannot scale operations
  • Higher operational costs
  • Technical debt accumulates

The 50% rule: SRE teams should spend no more than 50% of their time on toil. The other 50% should be engineering work that reduces future toil.

Identifying Toil

Toil Inventory Exercise

Step 1: Track your time

Keep a log for 2 weeks:

Date       | Task                    | Time | Category
-----------|-------------------------|------|----------
2025-10-15 | Restart crashed pods    | 30m  | Toil
2025-10-15 | Deploy new feature      | 2h   | Engineering
2025-10-15 | Manually scale database | 45m  | Toil
2025-10-15 | Respond to same alert   | 20m  | Toil
2025-10-16 | Design autoscaling      | 3h   | Engineering
2025-10-16 | Manual backup restore   | 1h   | Toil

Step 2: Categorize tasks

# Simple toil calculator
tasks = [
    {"name": "Restart services", "time_per_week": 2.0, "automatable": True},
    {"name": "Deploy manually", "time_per_week": 3.0, "automatable": True},
    {"name": "Create user accounts", "time_per_week": 1.5, "automatable": True},
    {"name": "Respond to alerts", "time_per_week": 4.0, "automatable": True},
    {"name": "Manual testing", "time_per_week": 2.5, "automatable": True},
]

total_toil = sum(t["time_per_week"] for t in tasks if t["automatable"])
work_hours = 40
toil_percentage = (total_toil / work_hours) * 100

print(f"Total toil: {total_toil} hours/week ({toil_percentage:.1f}%)")
# Output: Total toil: 13.0 hours/week (32.5%)

Step 3: Prioritize by impact

Calculate toil score:

Toil Score = (Time per Week) × (Frequency) × (Pain Factor)
TaskTimeFreq/WeekPainScore
Manual deployments30m10High (3)150
Restart crashed pods15m20Med (2)60
Create DB users20m5Low (1)20
Manual backups45m1Med (2)18

Priority: Automate manual deployments first.

Common Sources of Toil

Infrastructure:

  • Manual server provisioning
  • Hand-rolled deployments
  • Manual scaling operations
  • Manual certificate renewals
  • Manual backup/restore

Operations:

  • Responding to repeated alerts
  • Manual health checks
  • Log diving for common issues
  • Restarting failed services
  • Clearing disk space

Development:

  • Manual testing
  • Code review reminders
  • Manual dependency updates
  • Environment setup
  • Build and deployment

Support:

  • Password resets
  • Permission grants
  • Account provisioning
  • Data exports
  • Configuration changes

Measuring Toil

Key Metrics

metrics:
  - name: toil_percentage
    calculation: "toil_hours / total_work_hours * 100"
    target: "<50%"
    measure: "Weekly"

  - name: toil_by_category
    categories: ["incidents", "deployments", "provisioning", "other"]
    measure: "Weekly"

  - name: automation_roi
    calculation: "time_saved / time_invested"
    target: ">5x"
    measure: "Per project"

  - name: mean_time_to_automate
    calculation: "time from identification to automation"
    target: "<30 days"
    measure: "Per toil item"

Tracking Dashboard

Grafana/Datadog dashboard:

{
  "dashboard": {
    "title": "Toil Tracking",
    "panels": [
      {
        "title": "Toil Percentage by Week",
        "type": "graph",
        "target": "toil_hours / 40 * 100"
      },
      {
        "title": "Toil by Category",
        "type": "pie",
        "categories": ["Incidents", "Deployments", "Provisioning", "Other"]
      },
      {
        "title": "Top Toil Tasks",
        "type": "table",
        "columns": ["Task", "Hours/Week", "Frequency", "Status"]
      },
      {
        "title": "Automation ROI",
        "type": "stat",
        "calculation": "total_time_saved / total_automation_effort"
      }
    ]
  }
}

Toil Budget

Set team limits:

team: platform-sre
toil_budget:
  max_percentage: 50
  current: 35
  status: "healthy"

weekly_breakdown:
  engineering_work: 26 hours (65%)
  toil: 14 hours (35%)
    - incident_response: 6 hours
    - deployments: 4 hours
    - provisioning: 2 hours
    - other: 2 hours

actions:
  - status: "on_track"
    message: "Toil within acceptable limits"

alerts:
  - threshold: 50
    action: "Stop new projects, focus on automation"
  - threshold: 60
    action: "Emergency automation sprint"

Elimination Strategies

1. Automation

ROI calculation:

def automation_roi(
    manual_time_per_occurrence,
    occurrences_per_week,
    automation_build_time,
    maintenance_time_per_week=0
):
    """
    Calculate return on investment for automation

    Returns weeks until breakeven
    """
    weekly_savings = manual_time_per_occurrence * occurrences_per_week
    weekly_cost = maintenance_time_per_week
    net_weekly_savings = weekly_savings - weekly_cost

    if net_weekly_savings <= 0:
        return float('inf')  # Never breaks even

    breakeven_weeks = automation_build_time / net_weekly_savings
    annual_roi = (net_weekly_savings * 52) / automation_build_time

    return {
        'breakeven_weeks': breakeven_weeks,
        'annual_roi': annual_roi,
        'worth_automating': breakeven_weeks < 12  # Less than 3 months
    }

# Example: Manual deployment
result = automation_roi(
    manual_time_per_occurrence=30/60,  # 30 minutes
    occurrences_per_week=10,           # 10 deploys/week
    automation_build_time=16,          # 2 days to build
    maintenance_time_per_week=0.5      # 30 min/week maintenance
)

print(f"Breakeven: {result['breakeven_weeks']:.1f} weeks")
print(f"Annual ROI: {result['annual_roi']:.1f}x")
print(f"Worth it: {result['worth_automating']}")
# Output:
# Breakeven: 3.6 weeks
# Annual ROI: 14.4x
# Worth it: True

Automation priorities:

  1. High frequency, high time - Automate first

    • Example: Daily deployments taking 30 min each
  2. High frequency, low time - Good candidate

    • Example: Restarting services (5 min, 20x/week)
  3. Low frequency, high time - Maybe automate

    • Example: Quarterly disaster recovery test (4 hours)
  4. Low frequency, low time - Keep manual

    • Example: Annual SSL cert renewal (15 minutes)

2. Self-Service Tools

Internal developer platform:

# Example: Self-service database provisioning

# Before (toil):
process: "Slack message → SRE creates DB → Grants permissions → Notifies dev"
time: 2 hours
friction: High

# After (self-service):
process: "Developer runs: kubectl apply -f db-request.yaml"
time: 5 minutes (automated)
friction: Low

Implementation:

# Self-service database creation tool
#!/bin/bash
# File: create-database.sh

DB_NAME=$1
OWNER=$2

if [[ -z "$DB_NAME" || -z "$OWNER" ]]; then
    echo "Usage: create-database.sh <db-name> <owner-email>"
    exit 1
fi

# Validate request
if [[ ! "$OWNER" =~ @company.com$ ]]; then
    echo "Error: Owner must be @company.com email"
    exit 1
fi

# Create database via IaC
cat <<EOF > terraform/databases/${DB_NAME}.tf
resource "aws_db_instance" "${DB_NAME}" {
  identifier = "${DB_NAME}"
  engine     = "postgres"
  instance_class = "db.t3.micro"

  tags = {
    Owner = "${OWNER}"
    CreatedBy = "self-service"
  }
}
EOF

# Apply terraform
cd terraform/databases
terraform apply -auto-approve

# Grant permissions
OWNER_USER=$(echo $OWNER | cut -d@ -f1)
psql -c "GRANT ALL ON DATABASE ${DB_NAME} TO ${OWNER_USER};"

# Notify via Slack
curl -X POST $SLACK_WEBHOOK -d "{
  \"text\": \"✅ Database ${DB_NAME} created for ${OWNER}\"
}"

echo "Database ${DB_NAME} ready! Connection info sent to ${OWNER}"

3. Improved Tooling

Before:

# Manual deployment (20 steps, error-prone)
ssh server1
cd /app
git pull
npm install
npm run build
pm2 restart app
# ... repeat for 10 servers

After:

# One-command deployment
deploy-app production v2.3.0

Tool characteristics:

  • Idempotent - Safe to run multiple times
  • Validated - Checks prerequisites
  • Logged - Audit trail
  • Atomic - All or nothing
  • Rollback-able - Easy to undo

4. Process Improvements

Example: Alert fatigue

Before:

  • 200 alerts/day
  • 10% actionable
  • 2 hours/day responding

Process improvements:

  1. Tune thresholds - Reduce noise
  2. Group related alerts - One page, not 10
  3. Add context - Include runbook link in alert
  4. Auto-remediate - Script fixes common issues

After:

  • 20 alerts/day
  • 80% actionable
  • 30 min/day responding

Configuration:

# Alert tuning example
alerts:
  - name: HighMemoryUsage
    # Before: threshold: 70%
    threshold: 85%  # Reduced noise
    for: 10m  # Require sustained issue
    annotations:
      runbook: "https://runbook/memory-issues"
      description: "Memory >85% for 10 min. Check runbook before paging."

  - name: DiskSpaceLow
    # Auto-remediation
    threshold: 80%
    actions:
      - run: "/scripts/cleanup-logs.sh"
      - notify: "slack"
      - page: "only_if_script_fails"

5. Eliminate Root Causes

Reactive approach (toil):

  • Service crashes → Restart manually
  • Disk fills up → Clean manually
  • Cert expires → Renew manually

Proactive approach (engineering):

  • Service crashes → Fix bug, add health checks, auto-restart
  • Disk fills up → Log rotation, monitoring, auto-cleanup
  • Cert expires → Automated renewal, monitoring

Example: Pod crashes

Toil approach:

# Manually restart pods when they crash
# Time: 10 min, 5x/week = 50 min/week toil
kubectl delete pod crash-pod-xyz

Engineering approach:

# 1. Fix application bug causing crashes
# 2. Add proper health checks
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

# 3. Configure automatic restart policy
restartPolicy: Always

# 4. Add resource limits to prevent OOM
resources:
  limits:
    memory: "512Mi"
  requests:
    memory: "256Mi"

# 5. Add monitoring to catch issues early
# Result: Pods self-heal, no manual intervention

Automation Examples

Example 1: Automated Deployments

Before (manual):

# 15 steps, 30 minutes, error-prone
ssh prod-server-1
cd /app
git pull origin main
npm install
npm run build
pm2 restart app
# Test manually
# Repeat for 9 more servers
# Update load balancer
# Update documentation

After (automated):

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Run tests
        run: npm test

      - name: Build
        run: npm run build

      - name: Deploy with Ansible
        run: |
          ansible-playbook -i inventory/production deploy.yml

      - name: Smoke tests
        run: |
          curl -f https://api.company.com/health

      - name: Notify Slack
        run: |
          curl -X POST $SLACK_WEBHOOK -d '{"text":"✅ Deployed to production"}'

Time savings:

  • Manual: 30 min × 10 deploys/week = 5 hours/week
  • Automated: 5 min monitoring × 10 deploys/week = 50 min/week
  • Savings: 4.2 hours/week (21 hours/month)

Example 2: Auto-Remediation

Scenario: Disk space cleanup

Toil (manual):

# Every week, manually clean up disk space
# Time: 45 min/week

ssh prod-server-1
df -h  # Check disk usage
find /var/log -name "*.log" -mtime +30 -delete
find /tmp -mtime +7 -delete
docker system prune -af

Automation:

#!/bin/bash
# File: /etc/cron.daily/cleanup-disk.sh
# Runs daily via cron

THRESHOLD=80
DISK_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')

if [ $DISK_USAGE -gt $THRESHOLD ]; then
    echo "Disk usage ${DISK_USAGE}% exceeds ${THRESHOLD}%"

    # Clean old logs
    find /var/log -name "*.log" -mtime +30 -delete
    echo "Cleaned old log files"

    # Clean temp files
    find /tmp -mtime +7 -delete
    echo "Cleaned temp files"

    # Docker cleanup
    docker system prune -af --volumes
    echo "Cleaned Docker resources"

    # Check if successful
    NEW_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')

    if [ $NEW_USAGE -gt $THRESHOLD ]; then
        # Still high, alert human
        curl -X POST $SLACK_WEBHOOK -d "{
            \"text\": \"⚠️ Disk cleanup ran but usage still ${NEW_USAGE}%\"
        }"
    else
        echo "Disk usage now ${NEW_USAGE}%"
    fi
fi

Kubernetes approach:

# Automated with log rotation and volume limits
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    volumeMounts:
    - name: logs
      mountPath: /var/log

  volumes:
  - name: logs
    emptyDir:
      sizeLimit: 1Gi  # Automatic limit

---
# Log rotation sidecar
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    # Main container

  - name: log-rotator
    image: blacklabelops/logrotate
    volumeMounts:
    - name: logs
      mountPath: /var/log

Time savings: 45 min/week → 0 min/week = 3 hours/month

Example 3: Chatops for Common Tasks

Toil: Responding to “Can you check…” requests

Solution: Chatbot in Slack

# Slack bot for common operations
from slack_bolt import App
import subprocess

app = App(token=SLACK_BOT_TOKEN)

@app.command("/check-service")
def check_service(ack, command, say):
    ack()
    service = command['text']

    # Run health check
    result = subprocess.run(
        f"kubectl get pods -l app={service}",
        shell=True,
        capture_output=True,
        text=True
    )

    say(f"```{result.stdout}```")

@app.command("/deploy")
def deploy_service(ack, command, say):
    ack()

    # Parse: /deploy myapp v2.3.0 production
    parts = command['text'].split()
    if len(parts) != 3:
        say("Usage: /deploy <service> <version> <environment>")
        return

    service, version, env = parts

    # Trigger deployment
    say(f"🚀 Deploying {service}@{version} to {env}...")

    result = subprocess.run(
        f"./deploy.sh {service} {version} {env}",
        shell=True,
        capture_output=True,
        text=True
    )

    if result.returncode == 0:
        say(f"✅ Deployment successful!")
    else:
        say(f"❌ Deployment failed:\n```{result.stderr}```")

@app.command("/scale")
def scale_service(ack, command, say):
    ack()

    # Parse: /scale myapp 10
    parts = command['text'].split()
    service, replicas = parts[0], parts[1]

    subprocess.run(
        f"kubectl scale deployment/{service} --replicas={replicas}",
        shell=True
    )

    say(f"📈 Scaled {service} to {replicas} replicas")

app.start(port=3000)

Time savings:

  • Before: 5 min per request × 20 requests/week = 100 min/week
  • After: Instant self-service
  • Savings: 1.7 hours/week (6.7 hours/month)

Example 4: Infrastructure as Code

Toil: Manual server provisioning

Before:

1. Fill out request form
2. Wait for approval (1-3 days)
3. SRE manually creates server via web console
4. SRE manually configures networking
5. SRE manually installs software
6. SRE manually configures monitoring
7. SRE updates documentation
8. Total time: 2-4 hours of SRE time + 1-3 days waiting

After (Terraform):

# File: servers.tf
module "application_server" {
  source = "./modules/server"

  name          = "app-server-prod-1"
  instance_type = "t3.large"
  environment   = "production"

  monitoring_enabled = true
  backup_enabled     = true

  tags = {
    Team  = "platform"
    Owner = "[email protected]"
  }
}

# Apply with: terraform apply
# Time: 5 minutes (automated)

Self-service request:

# Developer makes PR to add their server
# PR approved → GitHub Actions applies Terraform
# Server provisioned automatically in 5 minutes

Time savings:

  • Before: 2 hours × 5 requests/week = 10 hours/week
  • After: 15 min review × 5 requests/week = 1.25 hours/week
  • Savings: 8.75 hours/week (35 hours/month)

Implementation Roadmap

Phase 1: Measure (Week 1-2)

**Goals:**
- Quantify current toil
- Identify top toil sources
- Build team awareness

**Actions:**
- [ ] Track time for 2 weeks
- [ ] Categorize all tasks (toil vs engineering)
- [ ] Calculate toil percentage
- [ ] Identify top 10 toil tasks
- [ ] Present findings to team

**Deliverable:** Toil inventory spreadsheet

Phase 2: Quick Wins (Week 3-6)

**Goals:**
- Build automation momentum
- Demonstrate ROI
- Get team buy-in

**Actions:**
- [ ] Pick 3 high-frequency, high-pain tasks
- [ ] Automate or eliminate them
- [ ] Measure time savings
- [ ] Share success stories

**Example quick wins:**
- Automated deployment script
- Self-service user provisioning
- Alert auto-remediation for common issues

Phase 3: Systematic Reduction (Month 2-3)

**Goals:**
- Reduce toil to <50%
- Build sustainable practices
- Create reusable tools

**Actions:**
- [ ] Automate top 10 toil tasks
- [ ] Build internal developer platform
- [ ] Implement IaC for all infrastructure
- [ ] Create runbooks with auto-remediation
- [ ] Set up toil tracking dashboard

**Deliverable:** 50% reduction in toil percentage

Phase 4: Continuous Improvement (Ongoing)

**Goals:**
- Maintain low toil levels
- Prevent new toil
- Scale operations without scaling toil

**Actions:**
- [ ] Weekly toil reviews in team meetings
- [ ] Automation-first mindset for new work
- [ ] Toil budget enforcement
- [ ] Quarterly toil audits
- [ ] Share automation across teams

**Metrics:**
- Toil percentage stays <50%
- Automation ROI >5x
- Team satisfaction improving

Organizational Support

Building the Case for Toil Reduction

Executive presentation:

# Toil Reduction Initiative

## Current State
- Engineers spend 60% of time on toil
- 40% time on engineering projects
- High team turnover (3 people left citing "boring work")
- Slower feature delivery

## Proposal
- Invest 3 months in toil automation
- Goal: Reduce toil to <50%
- 20% time allocated to automation projects

## Expected ROI
- Free up 15 hours/week per engineer (1.5 months/year)
- Faster incident response
- Higher team morale
- Ability to scale without hiring

## Cost-Benefit
- Investment: $50K (3 months × 2 engineers)
- Annual savings: $200K (freed capacity)
- ROI: 4x in first year
- Payback period: 3 months

Team Practices

Make toil visible:

# Sprint planning
toil_capacity: 20 hours (50% of sprint)
engineering_capacity: 20 hours (50% of sprint)

sprint_backlog:
  toil:
    - On-call rotation (8 hours)
    - Incident response (6 hours)
    - Manual deployments (4 hours)
    - Support tickets (2 hours)

  engineering:
    - Automate deployment (8 hours)
    - Build self-service tool (8 hours)
    - Refactor monitoring (4 hours)

Toil reduction sprints:

  • Dedicate entire sprint to automation
  • No feature work, only toil elimination
  • Run quarterly

Celebrate wins:

# Team meeting: Toil Win of the Week

🏆 This week: Alice automated database backups
- Time saved: 2 hours/week
- Annual savings: 104 hours
- ROI: Built in 8 hours, 13x return

Next targets:
- Manual certificate renewals (Bob investigating)
- Log analysis for common errors (Carol prototyping)

Common Pitfalls

Pitfall 1: Automating Without Understanding

Problem:

# "Let's automate this!"
# Creates script that breaks production

Solution:

  1. Understand the manual process fully
  2. Document edge cases
  3. Test thoroughly in non-prod
  4. Gradual rollout
  5. Keep manual process as fallback

Pitfall 2: Over-Engineering

Problem:

Task takes 10 minutes/month to do manually
Spend 3 months building complex automation system

Solution:

  • Calculate ROI before building
  • Start simple (bash script before building platform)
  • Iterate based on actual usage

Pitfall 3: Automation Debt

Problem:

Built 20 automation scripts
No documentation
No maintenance
Scripts break, create more toil

Solution:

  • Document all automation
  • Version control everything
  • Assign ownership
  • Regular maintenance
  • Treat automation as production code

Pitfall 4: Ignoring Toil Budget

Problem:

"We're too busy to automate" (while drowning in toil)
Toil grows unchecked to 80% of time

Solution:

  • Enforce 50% toil budget
  • Stop feature work when over budget
  • Make toil reduction mandatory

Measuring Success

Key Performance Indicators

kpis:
  - name: toil_percentage
    current: 60%
    target: "<50%"
    trend: "decreasing"

  - name: automation_count
    description: "Number of tasks automated this quarter"
    current: 15
    target: 20

  - name: time_saved
    description: "Hours saved per week from automation"
    current: 25
    target: 40

  - name: team_satisfaction
    measurement: "Quarterly survey score"
    current: 6.5/10
    target: ">8/10"

  - name: incident_mttr
    description: "Mean time to resolution"
    current: 45_min
    target: "<30 min"
    note: "Improved through automation"

Before/After Comparison

## 6-Month Results

### Time Allocation
Before:
- Toil: 60% (24 hours/week)
- Engineering: 40% (16 hours/week)

After:
- Toil: 35% (14 hours/week)
- Engineering: 65% (26 hours/week)

**Result: 10 hours/week freed up per engineer**

### Specific Improvements
| Task | Before | After | Savings |
|------|--------|-------|---------|
| Deployments | 5h/week | 30m/week | 4.5h/week |
| Server provisioning | 8h/week | 1h/week | 7h/week |
| Certificate renewals | 2h/week | 0 (automated) | 2h/week |
| Log analysis | 4h/week | 30m/week | 3.5h/week |
| Disk cleanup | 1h/week | 0 (automated) | 1h/week |

**Total: 18 hours/week saved across team**

### Business Impact
- Feature delivery velocity: +40%
- Incidents per month: -30%
- Time to resolve incidents: -35%
- Team turnover: 0 (was 3/year)
- Employee satisfaction: 6.5 → 8.2/10

Tools and Resources

Automation Tools

Infrastructure:

  • Terraform / Pulumi - Infrastructure as Code
  • Ansible / Chef - Configuration management
  • Kubernetes operators - Self-healing systems

CI/CD:

  • GitHub Actions / GitLab CI
  • Jenkins / CircleCI
  • ArgoCD / Flux - GitOps

Monitoring & Auto-remediation:

  • Prometheus + Alertmanager
  • PagerDuty / Opsgenie
  • Rundeck - Automation platform

Chatops:

  • Hubot / Slack Bolt SDK
  • Atlassian Stride
  • Custom Slack/Discord bots

Useful Scripts

# Calculate toil from time tracking
#!/bin/bash
# File: calculate-toil.sh

CSV_FILE="time-tracking.csv"

total_hours=$(awk -F',' '{sum+=$3} END {print sum}' $CSV_FILE)
toil_hours=$(awk -F',' '$4=="toil" {sum+=$3} END {print sum}' $CSV_FILE)

toil_percentage=$(echo "scale=1; $toil_hours / $total_hours * 100" | bc)

echo "Total hours: $total_hours"
echo "Toil hours: $toil_hours"
echo "Toil percentage: $toil_percentage%"

if (( $(echo "$toil_percentage > 50" | bc -l) )); then
    echo "⚠️  Toil exceeds 50% target!"
else
    echo "✅ Toil within acceptable range"
fi

Conclusion

Toil reduction is not a one-time project—it’s a continuous practice. Key takeaways:

  1. Measure first - You can’t improve what you don’t measure
  2. Calculate ROI - Automate high-impact tasks first
  3. Start small - Quick wins build momentum
  4. Enforce budgets - Keep toil below 50%
  5. Celebrate successes - Share wins, build automation culture
  6. Think long-term - Prevent toil, don’t just treat symptoms
  7. Make it cultural - Automation-first mindset for all work

Remember: Time spent on toil is time not spent on innovation, improvement, and engineering. Every hour of toil eliminated is an hour gained for valuable work.

“The best time to automate toil was yesterday. The second best time is today.”