Incident Summary
Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures
Quick Facts
- Users Affected: ~40% experiencing intermittent errors
- Services Affected: Multiple microservices across 3 Kubernetes nodes
- Nodes Failed: 3 out of 8 worker nodes
- Pods Evicted: 47 pods due to disk pressure
- SLO Impact: 35% of monthly error budget consumed
Timeline
- 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3
- 11:22:00 - On-call engineer (Tom) acknowledged alert
- 11:25:00 - Checked node: 92% disk usage, mostly logs
- 11:28:00 - Second alert: node-worker-5 also >85%
- 11:30:00 - Third alert: node-worker-7 >85%
- 11:32:00 - Senior SRE (Rachel) joined investigation
- 11:35:00 - Pattern identified: All nodes running logging-agent pod
- 11:38:00 - First node reached 98% disk usage
- 11:40:00 - Kubelet started evicting pods due to disk pressure
- 11:42:00 - 12 pods evicted from node-worker-3
- 11:45:00 - User reports: Intermittent 503 errors
- 11:47:00 - Incident escalated to SEV-2
- 11:50:00 - Identified root cause: Log rotation not working for logging-agent
- 11:52:00 - Emergency: Manual log cleanup on affected nodes
- 11:58:00 - First node cleaned: 92% โ 45% disk usage
- 12:05:00 - Second node cleaned: 88% โ 40% disk usage
- 12:10:00 - Third node cleaned: 95% โ 42% disk usage
- 12:15:00 - All evicted pods rescheduled and running
- 12:30:00 - Deployed fix for log rotation issue
- 12:45:00 - Monitoring shows disk usage stabilizing
- 13:00:00 - Implemented automated log cleanup job
- 13:30:00 - Added improved monitoring and alerts
- 14:15:00 - Verified all nodes healthy, services normal
- 15:05:00 - Incident marked as resolved
Root Cause Analysis
What Happened
A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.
The cascade:
- July 15 - Logging agent v2.4.0 deployed with new configuration
- July 15-22 - Log files accumulated without rotation
- July 22 11:20 - First node reached 85% disk usage
- July 22 11:40 - Nodes hit 98%, kubelet started evicting pods
- July 22 11:45 - Service degradation due to reduced capacity
The Configuration Bug
Problematic configuration:
# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd:v1.16-1
volumeMounts:
- name: varlog
mountPath: /var/log
- name: fluentd-buffer
mountPath: /var/fluentd/buffer # BUG: Unbounded growth!
volumes:
- name: varlog
hostPath:
path: /var/log
- name: fluentd-buffer
hostPath:
path: /mnt/fluentd-buffer # Using host disk without limits
Fluentd configuration (fluent.conf):
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/fluentd/buffer/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
</parse>
</source>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
# BUG: Buffer settings without size limits!
<buffer>
@type file
path /var/fluentd/buffer/buffer.*
flush_interval 5s
retry_max_times 3
# Missing: chunk_limit_size, total_limit_size
</buffer>
</match>
What went wrong:
- No buffer size limit - Fluentd buffer could grow indefinitely
- Elasticsearch backpressure - When ES was slow, buffer accumulated
- No log rotation - Old buffer files never cleaned up
- Position file growth -
.pos
files tracking log positions grew large - Failed retries stored - Failed log shipments saved to disk
Disk Usage Breakdown
node-worker-3 at incident peak (98% full):
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 98G 2G 98% /
Breakdown:
/mnt/fluentd-buffer/ 82GB โ The problem!
โโ buffer.*.log 75GB (unsent logs)
โโ *.pos files 5GB (position tracking)
โโ retry buffers 2GB (failed retries)
/var/log/ 8GB
โโ containers/*.log 5GB (normal)
โโ pods/*.log 3GB (normal)
/var/lib/docker/ 6GB (normal)
/var/lib/kubelet/ 2GB (normal)
Why Elasticsearch Was Slow
During investigation, discovered Elasticsearch cluster was overloaded:
# Elasticsearch cluster status
curl -X GET "elasticsearch:9200/_cluster/health?pretty"
{
"status": "yellow",
"active_shards": 45,
"relocating_shards": 5,
"initializing_shards": 2,
"unassigned_shards": 8,
"number_of_pending_tasks": 127 โ High!
}
# Indexing rate
curl -X GET "elasticsearch:9200/_stats/indexing?pretty"
{
"indexing": {
"index_total": 1250000000,
"index_time_in_millis": 28800000,
"index_failed": 45000 โ Failures!
}
}
Elasticsearch problems:
- Too many small shards (inefficient)
- No index lifecycle management (ILM)
- Heap size too small for index rate
- All 7 days of logs in “hot” tier
This created a feedback loop:
ES slow โ Fluentd buffers โ Disk fills โ Pods evicted โ More logs โ ES slower
Immediate Fix
Step 1: Emergency Log Cleanup
# SSH to affected nodes
for node in node-worker-3 node-worker-5 node-worker-7; do
echo "Cleaning $node..."
ssh $node << 'EOF'
# Check disk usage
df -h /
# Stop fluentd temporarily
systemctl stop fluentd
# Clean old buffer files (older than 1 hour)
find /mnt/fluentd-buffer -name "buffer.*.log" -mmin +60 -delete
find /mnt/fluentd-buffer -name "*.pos" -size +100M -delete
# Compress old container logs
find /var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;
# Check disk after cleanup
df -h /
# Restart fluentd
systemctl start fluentd
EOF
done
# Results:
# node-worker-3: 98% โ 45%
# node-worker-5: 88% โ 40%
# node-worker-7: 95% โ 42%
Step 2: Fix Fluentd Configuration
Updated configuration with limits:
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
<buffer>
@type file
path /var/fluentd/buffer/buffer.*
flush_interval 5s
retry_max_times 3
# FIX: Add size limits
chunk_limit_size 8MB # Max size per chunk
total_limit_size 2GB # Max total buffer size
overflow_action drop_oldest_chunk # Drop old data if full
</buffer>
</match>
Updated DaemonSet with emptyDir (ephemeral):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
template:
spec:
containers:
- name: fluentd
image: fluent/fluentd:v1.16-1
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: fluentd-buffer
mountPath: /var/fluentd/buffer
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
volumes:
- name: varlog
hostPath:
path: /var/log
- name: fluentd-buffer
emptyDir:
sizeLimit: 5Gi # FIX: Limit buffer to 5GB per node
Deploy updated configuration:
# Update ConfigMap with new fluent.conf
kubectl create configmap fluentd-config \
--from-file=fluent.conf \
--dry-run=client -o yaml | kubectl apply -f -
# Update DaemonSet
kubectl apply -f fluentd-daemonset.yaml
# Rolling restart
kubectl rollout restart daemonset/fluentd -n logging
# Monitor rollout
kubectl rollout status daemonset/fluentd -n logging
Long-term Prevention
1. Automated Log Cleanup
Deployed log cleanup CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: node-log-cleanup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: log-cleanup
image: alpine:3.18
command:
- /bin/sh
- -c
- |
#!/bin/sh
echo "Starting log cleanup on $(hostname)"
# Compress logs older than 1 day
find /host/var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;
# Delete compressed logs older than 3 days
find /host/var/log/containers -name "*.log.gz" -mtime +3 -delete
# Delete old journal logs
journalctl --vacuum-time=7d
# Report disk usage
echo "Disk usage after cleanup:"
df -h /host/var/log
volumeMounts:
- name: varlog
mountPath: /host/var/log
securityContext:
privileged: true
volumes:
- name: varlog
hostPath:
path: /var/log
restartPolicy: OnFailure
tolerations:
- effect: NoSchedule
operator: Exists
2. Disk Usage Monitoring
Enhanced Prometheus alerts:
# Prometheus alert rules
groups:
- name: disk_space
rules:
- alert: NodeDiskPressure
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} disk usage >85%"
description: "Only {{ $value | humanizePercentage }} disk space available"
- alert: NodeDiskPressureCritical
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) < 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} disk usage >95%"
description: "Critical: Only {{ $value | humanizePercentage }} disk space available"
- alert: NodeDiskFilling
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} disk will fill in <24h"
description: "Disk predicted to fill based on 6h trend"
- alert: FluentdBufferHigh
expr: |
node_filesystem_avail_bytes{mountpoint="/",
job="node-exporter"} < 10737418240 # 10GB
labels:
severity: warning
annotations:
summary: "Fluentd buffer may be accumulating"
Grafana dashboard:
{
"dashboard": {
"title": "Node Disk Usage",
"panels": [
{
"title": "Disk Usage by Node",
"targets": [{
"expr": "100 - (node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} * 100)"
}],
"thresholds": [
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
},
{
"title": "Disk Usage Trend",
"targets": [{
"expr": "predict_linear(node_filesystem_avail_bytes{mountpoint='/'}[1h], 24*3600)"
}]
},
{
"title": "Top Directories by Size",
"targets": [{
"expr": "node_directory_size_bytes"
}]
}
]
}
}
3. Elasticsearch Improvements
Implement Index Lifecycle Management (ILM):
// Elasticsearch ILM policy
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "7d",
"actions": {
"freeze": {},
"set_priority": {
"priority": 0
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
// Apply to index template
PUT _index_template/logs_template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs",
"number_of_shards": 3,
"number_of_replicas": 1
}
}
}
Elasticsearch monitoring:
- alert: ElasticsearchIndexingBacklog
expr: elasticsearch_indices_indexing_index_current > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Elasticsearch has {{ $value }} pending indexing operations"
- alert: ElasticsearchDiskUsage
expr: |
100 - (elasticsearch_filesystem_data_available_bytes /
elasticsearch_filesystem_data_size_bytes * 100) > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Elasticsearch disk usage >85%"
4. Kubernetes Best Practices
Set resource limits on DaemonSets:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
spec:
template:
spec:
containers:
- name: fluentd
resources:
requests:
memory: "256Mi"
cpu: "200m"
ephemeral-storage: "2Gi" # Request storage
limits:
memory: "512Mi"
cpu: "500m"
ephemeral-storage: "5Gi" # Limit storage
Kubelet eviction thresholds:
# kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%" # Evict at 90% disk usage
nodefs.inodesFree: "5%"
imagefs.available: "15%"
evictionSoft:
memory.available: "200Mi"
nodefs.available: "15%" # Warn at 85% disk usage
nodefs.inodesFree: "10%"
imagefs.available: "20%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"
nodefs.inodesFree: "2m"
imagefs.available: "2m"
5. Log Sampling and Filtering
Reduce log volume with sampling:
# fluentd configuration with sampling
<filter kubernetes.**>
@type sampling
sample_rate 10 # Keep only 10% of debug logs
tag_key level
tag_pattern /DEBUG/
</filter>
<filter kubernetes.**>
@type grep
<exclude>
key log
pattern /healthcheck|heartbeat/ # Exclude health checks
</exclude>
</filter>
<filter kubernetes.**>
@type record_transformer
enable_ruby true
<record>
# Add metadata but keep messages short
log ${record["log"][0..1000]} # Truncate to 1000 chars
</record>
</filter>
Lessons Learned
What Went Well โ
- Alert fired early - Caught at 85% before critical failure
- Manual cleanup effective - Emergency response freed space quickly
- No data loss - All critical logs already in Elasticsearch
- Good pod distribution - Only 3/8 nodes affected
- Kubernetes resilience - Automatically rescheduled evicted pods
- Team response - Quick identification of root cause
What Went Wrong โ
- No buffer limits - Fluentd allowed unlimited disk usage
- Using hostPath - Direct host disk access without limits
- No log rotation - Buffer files accumulated indefinitely
- ES backpressure ignored - No alerting on indexing backlog
- No disk trend monitoring - Problem built up over 7 days
- Testing gap - Didn’t test behavior when ES was slow
- No cleanup automation - Manual process required
Surprises ๐ฎ
- How fast nodes filled - 85% to 98% in 20 minutes
- Buffer files size - Single node accumulated 82GB
- Cascading effect - Node issues caused more logs, making it worse
- Pod eviction impact - Users noticed immediately
- Elasticsearch was bottleneck - Logging agent issue, but ES caused it
Action Items
Completed โ
Action | Owner | Completed |
---|---|---|
Emergency disk cleanup on affected nodes | SRE Team | 2025-07-22 |
Add buffer size limits to Fluentd | SRE Team | 2025-07-22 |
Change to emptyDir with size limits | SRE Team | 2025-07-22 |
Deploy log cleanup CronJob | Platform Team | 2025-07-22 |
Add disk space trend monitoring | SRE Team | 2025-07-23 |
In Progress ๐
Action | Owner | Target Date |
---|---|---|
Implement Elasticsearch ILM policies | Platform Team | 2025-07-30 |
Add log sampling for debug logs | Dev Team | 2025-08-05 |
Review all DaemonSets for resource limits | SRE Team | 2025-08-10 |
Planned โณ
Action | Owner | Target Date |
---|---|---|
Increase node disk size from 100GB to 200GB | Infrastructure Team | 2025-08-15 |
Implement dedicated logging nodes | Platform Team | 2025-09-01 |
Add log volume budgets per namespace | Platform Team | 2025-09-15 |
Chaos testing: Simulate slow Elasticsearch | SRE Team | 2025-10-01 |
Technical Deep Dive
Kubernetes Disk Pressure Eviction
How Kubernetes handles disk pressure:
1. Kubelet monitors disk usage every 10 seconds
2. When threshold exceeded:
โโ Set node condition: DiskPressure=True
โโ Stop scheduling new pods on node
โโ Start evicting pods
3. Pod eviction priority (lowest to highest):
โโ BestEffort pods (no resources set)
โโ Burstable pods (requests < limits)
โโ Guaranteed pods (requests = limits)
4. Within same QoS class:
โโ Pods exceeding resource requests
โโ Pods under resource requests (by creation time)
During our incident:
# Node status showed disk pressure
kubectl describe node node-worker-3
Conditions:
Type Status LastHeartbeatTime Reason
---- ------ ----------------- ------
DiskPressure True Jul 22 11:40 UTC KubeletHasDiskPressure
Events:
Type Reason Message
---- ------ -------
Warning EvictionStarted Evicting pod nginx-7d8b49c8-xyz (BestEffort)
Warning EvictionStarted Evicting pod api-worker-abc-123 (Burstable)
Log Volume Calculations
Our log volume:
Cluster setup:
- 8 nodes
- 150 pods average per node
- 1200 total pods
Log generation per pod:
- Average: 10 KB/sec per pod
- Total: 1200 ร 10 KB/sec = 12 MB/sec
- Per day: 12 MB/sec ร 86400 = 1036 GB/day โ 1 TB/day!
Elasticsearch storage:
- 30-day retention
- Needed: 30 TB
- Had: 10 TB (problem!)
- With ILM (compression + warm tier): ~12 TB needed
Fluentd Buffer Mechanics
How Fluentd buffering works:
1. Read log โ 2. Buffer โ 3. Send to destination
Buffer types:
โโ Memory buffer (fast, but limited)
โโ File buffer (slower, but persistent)
File buffer process:
1. Chunk created (default 8MB)
2. Chunk filled with log lines
3. Chunk "staged" when full or flush_interval
4. Chunk sent to destination
5. Chunk deleted on success
Problem scenario:
1. Chunk staged โ 2. Send fails โ 3. Retry โ 4. Keep failing
โ Chunks accumulate on disk!
Our issue:
Normal operation:
- Chunk created: 8MB every 30 seconds
- Send time: 2 seconds
- Chunks on disk: 1-2 (16MB)
During ES slowdown:
- Chunk created: 8MB every 30 seconds
- Send time: 120 seconds (slow!)
- Chunks on disk: 60+ (480MB)
After 7 days:
- Failed chunks accumulated: 82GB
- Retry queue: 10,000+ files
- Disk: 98% full
Appendix
Useful Commands
Check disk usage on nodes:
# Via kubectl
kubectl top nodes
# Detailed on specific node
kubectl get --raw /api/v1/nodes/node-worker-3/proxy/stats/summary | jq '.node.fs'
# SSH to node
ssh node-worker-3
df -h
du -sh /var/log/* | sort -h
du -sh /mnt/fluentd-buffer/*
Check pod evictions:
# See evicted pods
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.status.reason == "Evicted") |
"\(.metadata.namespace)/\(.metadata.name)"'
# Count evictions
kubectl get events --all-namespaces | grep -i evicted | wc -l
Monitor Fluentd buffer:
# Check buffer size
kubectl exec -it fluentd-xyz -n logging -- \
du -sh /var/fluentd/buffer
# Check buffer file count
kubectl exec -it fluentd-xyz -n logging -- \
find /var/fluentd/buffer -name "buffer.*" | wc -l
# Check Fluentd metrics
kubectl exec -it fluentd-xyz -n logging -- \
curl localhost:24220/api/plugins.json | jq '.plugins[] | select(.type == "output")'
Emergency cleanup script:
#!/bin/bash
# emergency-disk-cleanup.sh
NODE=$1
THRESHOLD=90
current_usage=$(ssh $NODE "df / | tail -1 | awk '{print \$5}' | sed 's/%//'")
if [ "$current_usage" -gt "$THRESHOLD" ]; then
echo "Node $NODE at ${current_usage}%, cleaning..."
ssh $NODE << 'EOF'
# Stop log collection temporarily
systemctl stop fluentd
# Clean old logs
find /var/log/containers -name "*.log" -mtime +1 -delete
find /mnt/fluentd-buffer -type f -mmin +60 -delete
# Clear journal logs
journalctl --vacuum-time=3d
# Restart fluentd
systemctl start fluentd
# Report results
df -h /
EOF
fi
References
Incident Commander: Rachel Foster Contributors: Tom Brady (On-call), Sam Mitchell (Platform), Diana Prince (Dev) Postmortem Completed: 2025-07-23 Next Review: 2025-08-23 (1 month follow-up)