Incident: Disk Space Exhaustion Causes Node Failures

Incident Summary

Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures

Quick Facts

Users Affected: ~40% experiencing intermittent errors
Services Affected: Multiple microservices across 3 Kubernetes nodes
Nodes Failed: 3 out of 8 worker nodes
Pods Evicted: 47 pods due to disk pressure
SLO Impact: 35% of monthly error budget consumed

Timeline

11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3
11:22:00 - On-call engineer (Tom) acknowledged alert
11:25:00 - Checked node: 92% disk usage, mostly logs
11:28:00 - Second alert: node-worker-5 also >85%
11:30:00 - Third alert: node-worker-7 >85%
11:32:00 - Senior SRE (Rachel) joined investigation
11:35:00 - Pattern identified: All nodes running logging-agent pod
11:38:00 - First node reached 98% disk usage
11:40:00 - Kubelet started evicting pods due to disk pressure
11:42:00 - 12 pods evicted from node-worker-3
11:45:00 - User reports: Intermittent 503 errors
11:47:00 - Incident escalated to SEV-2
11:50:00 - Identified root cause: Log rotation not working for logging-agent
11:52:00 - Emergency: Manual log cleanup on affected nodes
11:58:00 - First node cleaned: 92% → 45% disk usage
12:05:00 - Second node cleaned: 88% → 40% disk usage
12:10:00 - Third node cleaned: 95% → 42% disk usage
12:15:00 - All evicted pods rescheduled and running
12:30:00 - Deployed fix for log rotation issue
12:45:00 - Monitoring shows disk usage stabilizing
13:00:00 - Implemented automated log cleanup job
13:30:00 - Added improved monitoring and alerts
14:15:00 - Verified all nodes healthy, services normal
15:05:00 - Incident marked as resolved

Root Cause Analysis

What Happened

A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.

The cascade:

July 15 - Logging agent v2.4.0 deployed with new configuration
July 15-22 - Log files accumulated without rotation
July 22 11:20 - First node reached 85% disk usage
July 22 11:40 - Nodes hit 98%, kubelet started evicting pods
July 22 11:45 - Service degradation due to reduced capacity

The Configuration Bug

Problematic configuration:

# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.16-1
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: fluentd-buffer
          mountPath: /var/fluentd/buffer  # BUG: Unbounded growth!
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: fluentd-buffer
        hostPath:
          path: /mnt/fluentd-buffer  # Using host disk without limits

Fluentd configuration (fluent.conf):

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/fluentd/buffer/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
  </parse>
</source>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc.cluster.local
  port 9200
  logstash_format true

  # BUG: Buffer settings without size limits!
  <buffer>
    @type file
    path /var/fluentd/buffer/buffer.*
    flush_interval 5s
    retry_max_times 3
    # Missing: chunk_limit_size, total_limit_size
  </buffer>
</match>

What went wrong:

No buffer size limit - Fluentd buffer could grow indefinitely
Elasticsearch backpressure - When ES was slow, buffer accumulated
No log rotation - Old buffer files never cleaned up
Position file growth - .pos files tracking log positions grew large
Failed retries stored - Failed log shipments saved to disk

Disk Usage Breakdown

node-worker-3 at incident peak (98% full):

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   98G   2G  98% /

Breakdown:
/mnt/fluentd-buffer/       82GB  ← The problem!
  ├─ buffer.*.log          75GB  (unsent logs)
  ├─ *.pos files           5GB   (position tracking)
  └─ retry buffers         2GB   (failed retries)

/var/log/                  8GB
  ├─ containers/*.log      5GB   (normal)
  └─ pods/*.log            3GB   (normal)

/var/lib/docker/           6GB   (normal)
/var/lib/kubelet/          2GB   (normal)

Why Elasticsearch Was Slow

During investigation, discovered Elasticsearch cluster was overloaded:

# Elasticsearch cluster status
curl -X GET "elasticsearch:9200/_cluster/health?pretty"

{
  "status": "yellow",
  "active_shards": 45,
  "relocating_shards": 5,
  "initializing_shards": 2,
  "unassigned_shards": 8,
  "number_of_pending_tasks": 127  ← High!
}

# Indexing rate
curl -X GET "elasticsearch:9200/_stats/indexing?pretty"

{
  "indexing": {
    "index_total": 1250000000,
    "index_time_in_millis": 28800000,
    "index_failed": 45000  ← Failures!
  }
}

Elasticsearch problems:

Too many small shards (inefficient)
No index lifecycle management (ILM)
Heap size too small for index rate
All 7 days of logs in “hot” tier

This created a feedback loop:

ES slow → Fluentd buffers → Disk fills → Pods evicted → More logs → ES slower

Immediate Fix

Step 1: Emergency Log Cleanup

# SSH to affected nodes
for node in node-worker-3 node-worker-5 node-worker-7; do
  echo "Cleaning $node..."

  ssh $node << 'EOF'
    # Check disk usage
    df -h /

    # Stop fluentd temporarily
    systemctl stop fluentd

    # Clean old buffer files (older than 1 hour)
    find /mnt/fluentd-buffer -name "buffer.*.log" -mmin +60 -delete
    find /mnt/fluentd-buffer -name "*.pos" -size +100M -delete

    # Compress old container logs
    find /var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;

    # Check disk after cleanup
    df -h /

    # Restart fluentd
    systemctl start fluentd
EOF
done

# Results:
# node-worker-3: 98% → 45%
# node-worker-5: 88% → 40%
# node-worker-7: 95% → 42%

Step 2: Fix Fluentd Configuration

Updated configuration with limits:

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc.cluster.local
  port 9200
  logstash_format true

  <buffer>
    @type file
    path /var/fluentd/buffer/buffer.*
    flush_interval 5s
    retry_max_times 3

    # FIX: Add size limits
    chunk_limit_size 8MB          # Max size per chunk
    total_limit_size 2GB          # Max total buffer size
    overflow_action drop_oldest_chunk  # Drop old data if full
  </buffer>
</match>

Updated DaemonSet with emptyDir (ephemeral):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.16-1
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: fluentd-buffer
          mountPath: /var/fluentd/buffer
        resources:
          limits:
            memory: 512Mi
          requests:
            memory: 256Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: fluentd-buffer
        emptyDir:
          sizeLimit: 5Gi  # FIX: Limit buffer to 5GB per node

Deploy updated configuration:

# Update ConfigMap with new fluent.conf
kubectl create configmap fluentd-config \
  --from-file=fluent.conf \
  --dry-run=client -o yaml | kubectl apply -f -

# Update DaemonSet
kubectl apply -f fluentd-daemonset.yaml

# Rolling restart
kubectl rollout restart daemonset/fluentd -n logging

# Monitor rollout
kubectl rollout status daemonset/fluentd -n logging

Long-term Prevention

1. Automated Log Cleanup

Deployed log cleanup CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: node-log-cleanup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          hostPID: true
          hostNetwork: true
          containers:
          - name: log-cleanup
            image: alpine:3.18
            command:
            - /bin/sh
            - -c
            - |
              #!/bin/sh
              echo "Starting log cleanup on $(hostname)"

              # Compress logs older than 1 day
              find /host/var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;

              # Delete compressed logs older than 3 days
              find /host/var/log/containers -name "*.log.gz" -mtime +3 -delete

              # Delete old journal logs
              journalctl --vacuum-time=7d

              # Report disk usage
              echo "Disk usage after cleanup:"
              df -h /host/var/log

            volumeMounts:
            - name: varlog
              mountPath: /host/var/log
            securityContext:
              privileged: true
          volumes:
          - name: varlog
            hostPath:
              path: /var/log
          restartPolicy: OnFailure
          tolerations:
          - effect: NoSchedule
            operator: Exists

2. Disk Usage Monitoring

Enhanced Prometheus alerts:

# Prometheus alert rules
groups:
  - name: disk_space
    rules:
      - alert: NodeDiskPressure
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk usage >85%"
          description: "Only {{ $value | humanizePercentage }} disk space available"

      - alert: NodeDiskPressureCritical
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) < 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} disk usage >95%"
          description: "Critical: Only {{ $value | humanizePercentage }} disk space available"

      - alert: NodeDiskFilling
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk will fill in <24h"
          description: "Disk predicted to fill based on 6h trend"

      - alert: FluentdBufferHigh
        expr: |
          node_filesystem_avail_bytes{mountpoint="/",
            job="node-exporter"} < 10737418240  # 10GB
        labels:
          severity: warning
        annotations:
          summary: "Fluentd buffer may be accumulating"

Grafana dashboard:

{
  "dashboard": {
    "title": "Node Disk Usage",
    "panels": [
      {
        "title": "Disk Usage by Node",
        "targets": [{
          "expr": "100 - (node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} * 100)"
        }],
        "thresholds": [
          {"value": 80, "color": "yellow"},
          {"value": 90, "color": "red"}
        ]
      },
      {
        "title": "Disk Usage Trend",
        "targets": [{
          "expr": "predict_linear(node_filesystem_avail_bytes{mountpoint='/'}[1h], 24*3600)"
        }]
      },
      {
        "title": "Top Directories by Size",
        "targets": [{
          "expr": "node_directory_size_bytes"
        }]
      }
    ]
  }
}

3. Elasticsearch Improvements

Implement Index Lifecycle Management (ILM):

// Elasticsearch ILM policy
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "freeze": {},
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Apply to index template
PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs",
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

Elasticsearch monitoring:

- alert: ElasticsearchIndexingBacklog
  expr: elasticsearch_indices_indexing_index_current > 1000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Elasticsearch has {{ $value }} pending indexing operations"

- alert: ElasticsearchDiskUsage
  expr: |
    100 - (elasticsearch_filesystem_data_available_bytes /
           elasticsearch_filesystem_data_size_bytes * 100) > 85
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Elasticsearch disk usage >85%"

4. Kubernetes Best Practices

Set resource limits on DaemonSets:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
spec:
  template:
    spec:
      containers:
      - name: fluentd
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
            ephemeral-storage: "2Gi"  # Request storage
          limits:
            memory: "512Mi"
            cpu: "500m"
            ephemeral-storage: "5Gi"  # Limit storage

Kubelet eviction thresholds:

# kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"      # Evict at 90% disk usage
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "200Mi"
  nodefs.available: "15%"      # Warn at 85% disk usage
  nodefs.inodesFree: "10%"
  imagefs.available: "20%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"
  nodefs.inodesFree: "2m"
  imagefs.available: "2m"

5. Log Sampling and Filtering

Reduce log volume with sampling:

# fluentd configuration with sampling
<filter kubernetes.**>
  @type sampling
  sample_rate 10  # Keep only 10% of debug logs
  tag_key level
  tag_pattern /DEBUG/
</filter>

<filter kubernetes.**>
  @type grep
  <exclude>
    key log
    pattern /healthcheck|heartbeat/  # Exclude health checks
  </exclude>
</filter>

<filter kubernetes.**>
  @type record_transformer
  enable_ruby true
  <record>
    # Add metadata but keep messages short
    log ${record["log"][0..1000]}  # Truncate to 1000 chars
  </record>
</filter>

Lessons Learned

What Went Well ✓

Alert fired early - Caught at 85% before critical failure
Manual cleanup effective - Emergency response freed space quickly
No data loss - All critical logs already in Elasticsearch
Good pod distribution - Only 3/8 nodes affected
Kubernetes resilience - Automatically rescheduled evicted pods
Team response - Quick identification of root cause

What Went Wrong ✗

No buffer limits - Fluentd allowed unlimited disk usage
Using hostPath - Direct host disk access without limits
No log rotation - Buffer files accumulated indefinitely
ES backpressure ignored - No alerting on indexing backlog
No disk trend monitoring - Problem built up over 7 days
Testing gap - Didn’t test behavior when ES was slow
No cleanup automation - Manual process required

Surprises 😮

How fast nodes filled - 85% to 98% in 20 minutes
Buffer files size - Single node accumulated 82GB
Cascading effect - Node issues caused more logs, making it worse
Pod eviction impact - Users noticed immediately
Elasticsearch was bottleneck - Logging agent issue, but ES caused it

Action Items

Completed ✅

Action	Owner	Completed
Emergency disk cleanup on affected nodes	SRE Team	2025-07-22
Add buffer size limits to Fluentd	SRE Team	2025-07-22
Change to emptyDir with size limits	SRE Team	2025-07-22
Deploy log cleanup CronJob	Platform Team	2025-07-22
Add disk space trend monitoring	SRE Team	2025-07-23

In Progress 🔄

Action	Owner	Target Date
Implement Elasticsearch ILM policies	Platform Team	2025-07-30
Add log sampling for debug logs	Dev Team	2025-08-05
Review all DaemonSets for resource limits	SRE Team	2025-08-10

Planned ⏳

Action	Owner	Target Date
Increase node disk size from 100GB to 200GB	Infrastructure Team	2025-08-15
Implement dedicated logging nodes	Platform Team	2025-09-01
Add log volume budgets per namespace	Platform Team	2025-09-15
Chaos testing: Simulate slow Elasticsearch	SRE Team	2025-10-01

Technical Deep Dive

Kubernetes Disk Pressure Eviction

How Kubernetes handles disk pressure:

1. Kubelet monitors disk usage every 10 seconds
2. When threshold exceeded:
   ├─ Set node condition: DiskPressure=True
   ├─ Stop scheduling new pods on node
   └─ Start evicting pods

3. Pod eviction priority (lowest to highest):
   ├─ BestEffort pods (no resources set)
   ├─ Burstable pods (requests < limits)
   └─ Guaranteed pods (requests = limits)

4. Within same QoS class:
   ├─ Pods exceeding resource requests
   └─ Pods under resource requests (by creation time)

During our incident:

# Node status showed disk pressure
kubectl describe node node-worker-3

Conditions:
  Type             Status  LastHeartbeatTime   Reason
  ----             ------  -----------------   ------
  DiskPressure     True    Jul 22 11:40 UTC    KubeletHasDiskPressure

Events:
  Type     Reason            Message
  ----     ------            -------
  Warning  EvictionStarted   Evicting pod nginx-7d8b49c8-xyz (BestEffort)
  Warning  EvictionStarted   Evicting pod api-worker-abc-123 (Burstable)

Log Volume Calculations

Our log volume:

Cluster setup:
- 8 nodes
- 150 pods average per node
- 1200 total pods

Log generation per pod:
- Average: 10 KB/sec per pod
- Total: 1200 × 10 KB/sec = 12 MB/sec
- Per day: 12 MB/sec × 86400 = 1036 GB/day ≈ 1 TB/day!

Elasticsearch storage:
- 30-day retention
- Needed: 30 TB
- Had: 10 TB (problem!)
- With ILM (compression + warm tier): ~12 TB needed

Fluentd Buffer Mechanics

How Fluentd buffering works:

1. Read log → 2. Buffer → 3. Send to destination

Buffer types:
├─ Memory buffer (fast, but limited)
└─ File buffer (slower, but persistent)

File buffer process:
1. Chunk created (default 8MB)
2. Chunk filled with log lines
3. Chunk "staged" when full or flush_interval
4. Chunk sent to destination
5. Chunk deleted on success

Problem scenario:
1. Chunk staged → 2. Send fails → 3. Retry → 4. Keep failing
   → Chunks accumulate on disk!

Our issue:

Normal operation:
- Chunk created: 8MB every 30 seconds
- Send time: 2 seconds
- Chunks on disk: 1-2 (16MB)

During ES slowdown:
- Chunk created: 8MB every 30 seconds
- Send time: 120 seconds (slow!)
- Chunks on disk: 60+ (480MB)

After 7 days:
- Failed chunks accumulated: 82GB
- Retry queue: 10,000+ files
- Disk: 98% full

Appendix

Useful Commands

Check disk usage on nodes:

# Via kubectl
kubectl top nodes

# Detailed on specific node
kubectl get --raw /api/v1/nodes/node-worker-3/proxy/stats/summary | jq '.node.fs'

# SSH to node
ssh node-worker-3
df -h
du -sh /var/log/* | sort -h
du -sh /mnt/fluentd-buffer/*

Check pod evictions:

# See evicted pods
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.status.reason == "Evicted") |
    "\(.metadata.namespace)/\(.metadata.name)"'

# Count evictions
kubectl get events --all-namespaces | grep -i evicted | wc -l

Monitor Fluentd buffer:

# Check buffer size
kubectl exec -it fluentd-xyz -n logging -- \
  du -sh /var/fluentd/buffer

# Check buffer file count
kubectl exec -it fluentd-xyz -n logging -- \
  find /var/fluentd/buffer -name "buffer.*" | wc -l

# Check Fluentd metrics
kubectl exec -it fluentd-xyz -n logging -- \
  curl localhost:24220/api/plugins.json | jq '.plugins[] | select(.type == "output")'

Emergency cleanup script:

#!/bin/bash
# emergency-disk-cleanup.sh

NODE=$1
THRESHOLD=90

current_usage=$(ssh $NODE "df / | tail -1 | awk '{print \$5}' | sed 's/%//'")

if [ "$current_usage" -gt "$THRESHOLD" ]; then
  echo "Node $NODE at ${current_usage}%, cleaning..."

  ssh $NODE << 'EOF'
    # Stop log collection temporarily
    systemctl stop fluentd

    # Clean old logs
    find /var/log/containers -name "*.log" -mtime +1 -delete
    find /mnt/fluentd-buffer -type f -mmin +60 -delete

    # Clear journal logs
    journalctl --vacuum-time=3d

    # Restart fluentd
    systemctl start fluentd

    # Report results
    df -h /
EOF
fi

References

Incident Commander: Rachel Foster Contributors: Tom Brady (On-call), Sam Mitchell (Platform), Diana Prince (Dev) Postmortem Completed: 2025-07-23 Next Review: 2025-08-23 (1 month follow-up)

Incident Summary#

Quick Facts#

Timeline#

Root Cause Analysis#

What Happened#

The Configuration Bug#

Disk Usage Breakdown#

Why Elasticsearch Was Slow#

Immediate Fix#

Step 1: Emergency Log Cleanup#

Step 2: Fix Fluentd Configuration#

Long-term Prevention#

1. Automated Log Cleanup#

2. Disk Usage Monitoring#

3. Elasticsearch Improvements#

4. Kubernetes Best Practices#

5. Log Sampling and Filtering#

Lessons Learned#

What Went Well ✓#

What Went Wrong ✗#

Surprises 😮#

Action Items#

Completed ✅#

In Progress 🔄#

Planned ⏳#

Technical Deep Dive#

Kubernetes Disk Pressure Eviction#

Log Volume Calculations#

Fluentd Buffer Mechanics#

Appendix#

Useful Commands#

References#