Incident: Disk Space Exhaustion Causes Node Failures
Incident Summary Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures
Quick Facts Users Affected: ~40% experiencing intermittent errors Services Affected: Multiple microservices across 3 Kubernetes nodes Nodes Failed: 3 out of 8 worker nodes Pods Evicted: 47 pods due to disk pressure SLO Impact: 35% of monthly error budget consumed Timeline 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3 11:22:00 - On-call engineer (Tom) acknowledged alert 11:25:00 - Checked node: 92% disk usage, mostly logs 11:28:00 - Second alert: node-worker-5 also >85% 11:30:00 - Third alert: node-worker-7 >85% 11:32:00 - Senior SRE (Rachel) joined investigation 11:35:00 - Pattern identified: All nodes running logging-agent pod 11:38:00 - First node reached 98% disk usage 11:40:00 - Kubelet started evicting pods due to disk pressure 11:42:00 - 12 pods evicted from node-worker-3 11:45:00 - User reports: Intermittent 503 errors 11:47:00 - Incident escalated to SEV-2 11:50:00 - Identified root cause: Log rotation not working for logging-agent 11:52:00 - Emergency: Manual log cleanup on affected nodes 11:58:00 - First node cleaned: 92% → 45% disk usage 12:05:00 - Second node cleaned: 88% → 40% disk usage 12:10:00 - Third node cleaned: 95% → 42% disk usage 12:15:00 - All evicted pods rescheduled and running 12:30:00 - Deployed fix for log rotation issue 12:45:00 - Monitoring shows disk usage stabilizing 13:00:00 - Implemented automated log cleanup job 13:30:00 - Added improved monitoring and alerts 14:15:00 - Verified all nodes healthy, services normal 15:05:00 - Incident marked as resolved Root Cause Analysis What Happened A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.
…