Incident Summary

Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures

Quick Facts

  • Users Affected: ~40% experiencing intermittent errors
  • Services Affected: Multiple microservices across 3 Kubernetes nodes
  • Nodes Failed: 3 out of 8 worker nodes
  • Pods Evicted: 47 pods due to disk pressure
  • SLO Impact: 35% of monthly error budget consumed

Timeline

  • 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3
  • 11:22:00 - On-call engineer (Tom) acknowledged alert
  • 11:25:00 - Checked node: 92% disk usage, mostly logs
  • 11:28:00 - Second alert: node-worker-5 also >85%
  • 11:30:00 - Third alert: node-worker-7 >85%
  • 11:32:00 - Senior SRE (Rachel) joined investigation
  • 11:35:00 - Pattern identified: All nodes running logging-agent pod
  • 11:38:00 - First node reached 98% disk usage
  • 11:40:00 - Kubelet started evicting pods due to disk pressure
  • 11:42:00 - 12 pods evicted from node-worker-3
  • 11:45:00 - User reports: Intermittent 503 errors
  • 11:47:00 - Incident escalated to SEV-2
  • 11:50:00 - Identified root cause: Log rotation not working for logging-agent
  • 11:52:00 - Emergency: Manual log cleanup on affected nodes
  • 11:58:00 - First node cleaned: 92% โ†’ 45% disk usage
  • 12:05:00 - Second node cleaned: 88% โ†’ 40% disk usage
  • 12:10:00 - Third node cleaned: 95% โ†’ 42% disk usage
  • 12:15:00 - All evicted pods rescheduled and running
  • 12:30:00 - Deployed fix for log rotation issue
  • 12:45:00 - Monitoring shows disk usage stabilizing
  • 13:00:00 - Implemented automated log cleanup job
  • 13:30:00 - Added improved monitoring and alerts
  • 14:15:00 - Verified all nodes healthy, services normal
  • 15:05:00 - Incident marked as resolved

Root Cause Analysis

What Happened

A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.

The cascade:

  1. July 15 - Logging agent v2.4.0 deployed with new configuration
  2. July 15-22 - Log files accumulated without rotation
  3. July 22 11:20 - First node reached 85% disk usage
  4. July 22 11:40 - Nodes hit 98%, kubelet started evicting pods
  5. July 22 11:45 - Service degradation due to reduced capacity

The Configuration Bug

Problematic configuration:

# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.16-1
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: fluentd-buffer
          mountPath: /var/fluentd/buffer  # BUG: Unbounded growth!
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: fluentd-buffer
        hostPath:
          path: /mnt/fluentd-buffer  # Using host disk without limits

Fluentd configuration (fluent.conf):

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/fluentd/buffer/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
  </parse>
</source>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc.cluster.local
  port 9200
  logstash_format true

  # BUG: Buffer settings without size limits!
  <buffer>
    @type file
    path /var/fluentd/buffer/buffer.*
    flush_interval 5s
    retry_max_times 3
    # Missing: chunk_limit_size, total_limit_size
  </buffer>
</match>

What went wrong:

  1. No buffer size limit - Fluentd buffer could grow indefinitely
  2. Elasticsearch backpressure - When ES was slow, buffer accumulated
  3. No log rotation - Old buffer files never cleaned up
  4. Position file growth - .pos files tracking log positions grew large
  5. Failed retries stored - Failed log shipments saved to disk

Disk Usage Breakdown

node-worker-3 at incident peak (98% full):

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   98G   2G  98% /

Breakdown:
/mnt/fluentd-buffer/       82GB  โ† The problem!
  โ”œโ”€ buffer.*.log          75GB  (unsent logs)
  โ”œโ”€ *.pos files           5GB   (position tracking)
  โ””โ”€ retry buffers         2GB   (failed retries)

/var/log/                  8GB
  โ”œโ”€ containers/*.log      5GB   (normal)
  โ””โ”€ pods/*.log            3GB   (normal)

/var/lib/docker/           6GB   (normal)
/var/lib/kubelet/          2GB   (normal)

Why Elasticsearch Was Slow

During investigation, discovered Elasticsearch cluster was overloaded:

# Elasticsearch cluster status
curl -X GET "elasticsearch:9200/_cluster/health?pretty"

{
  "status": "yellow",
  "active_shards": 45,
  "relocating_shards": 5,
  "initializing_shards": 2,
  "unassigned_shards": 8,
  "number_of_pending_tasks": 127  โ† High!
}

# Indexing rate
curl -X GET "elasticsearch:9200/_stats/indexing?pretty"

{
  "indexing": {
    "index_total": 1250000000,
    "index_time_in_millis": 28800000,
    "index_failed": 45000  โ† Failures!
  }
}

Elasticsearch problems:

  • Too many small shards (inefficient)
  • No index lifecycle management (ILM)
  • Heap size too small for index rate
  • All 7 days of logs in “hot” tier

This created a feedback loop:

ES slow โ†’ Fluentd buffers โ†’ Disk fills โ†’ Pods evicted โ†’ More logs โ†’ ES slower

Immediate Fix

Step 1: Emergency Log Cleanup

# SSH to affected nodes
for node in node-worker-3 node-worker-5 node-worker-7; do
  echo "Cleaning $node..."

  ssh $node << 'EOF'
    # Check disk usage
    df -h /

    # Stop fluentd temporarily
    systemctl stop fluentd

    # Clean old buffer files (older than 1 hour)
    find /mnt/fluentd-buffer -name "buffer.*.log" -mmin +60 -delete
    find /mnt/fluentd-buffer -name "*.pos" -size +100M -delete

    # Compress old container logs
    find /var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;

    # Check disk after cleanup
    df -h /

    # Restart fluentd
    systemctl start fluentd
EOF
done

# Results:
# node-worker-3: 98% โ†’ 45%
# node-worker-5: 88% โ†’ 40%
# node-worker-7: 95% โ†’ 42%

Step 2: Fix Fluentd Configuration

Updated configuration with limits:

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc.cluster.local
  port 9200
  logstash_format true

  <buffer>
    @type file
    path /var/fluentd/buffer/buffer.*
    flush_interval 5s
    retry_max_times 3

    # FIX: Add size limits
    chunk_limit_size 8MB          # Max size per chunk
    total_limit_size 2GB          # Max total buffer size
    overflow_action drop_oldest_chunk  # Drop old data if full
  </buffer>
</match>

Updated DaemonSet with emptyDir (ephemeral):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  template:
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.16-1
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: fluentd-buffer
          mountPath: /var/fluentd/buffer
        resources:
          limits:
            memory: 512Mi
          requests:
            memory: 256Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: fluentd-buffer
        emptyDir:
          sizeLimit: 5Gi  # FIX: Limit buffer to 5GB per node

Deploy updated configuration:

# Update ConfigMap with new fluent.conf
kubectl create configmap fluentd-config \
  --from-file=fluent.conf \
  --dry-run=client -o yaml | kubectl apply -f -

# Update DaemonSet
kubectl apply -f fluentd-daemonset.yaml

# Rolling restart
kubectl rollout restart daemonset/fluentd -n logging

# Monitor rollout
kubectl rollout status daemonset/fluentd -n logging

Long-term Prevention

1. Automated Log Cleanup

Deployed log cleanup CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: node-log-cleanup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          hostPID: true
          hostNetwork: true
          containers:
          - name: log-cleanup
            image: alpine:3.18
            command:
            - /bin/sh
            - -c
            - |
              #!/bin/sh
              echo "Starting log cleanup on $(hostname)"

              # Compress logs older than 1 day
              find /host/var/log/containers -name "*.log" -mtime +1 -exec gzip {} \;

              # Delete compressed logs older than 3 days
              find /host/var/log/containers -name "*.log.gz" -mtime +3 -delete

              # Delete old journal logs
              journalctl --vacuum-time=7d

              # Report disk usage
              echo "Disk usage after cleanup:"
              df -h /host/var/log

            volumeMounts:
            - name: varlog
              mountPath: /host/var/log
            securityContext:
              privileged: true
          volumes:
          - name: varlog
            hostPath:
              path: /var/log
          restartPolicy: OnFailure
          tolerations:
          - effect: NoSchedule
            operator: Exists

2. Disk Usage Monitoring

Enhanced Prometheus alerts:

# Prometheus alert rules
groups:
  - name: disk_space
    rules:
      - alert: NodeDiskPressure
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk usage >85%"
          description: "Only {{ $value | humanizePercentage }} disk space available"

      - alert: NodeDiskPressureCritical
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) < 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} disk usage >95%"
          description: "Critical: Only {{ $value | humanizePercentage }} disk space available"

      - alert: NodeDiskFilling
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk will fill in <24h"
          description: "Disk predicted to fill based on 6h trend"

      - alert: FluentdBufferHigh
        expr: |
          node_filesystem_avail_bytes{mountpoint="/",
            job="node-exporter"} < 10737418240  # 10GB
        labels:
          severity: warning
        annotations:
          summary: "Fluentd buffer may be accumulating"

Grafana dashboard:

{
  "dashboard": {
    "title": "Node Disk Usage",
    "panels": [
      {
        "title": "Disk Usage by Node",
        "targets": [{
          "expr": "100 - (node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'} * 100)"
        }],
        "thresholds": [
          {"value": 80, "color": "yellow"},
          {"value": 90, "color": "red"}
        ]
      },
      {
        "title": "Disk Usage Trend",
        "targets": [{
          "expr": "predict_linear(node_filesystem_avail_bytes{mountpoint='/'}[1h], 24*3600)"
        }]
      },
      {
        "title": "Top Directories by Size",
        "targets": [{
          "expr": "node_directory_size_bytes"
        }]
      }
    ]
  }
}

3. Elasticsearch Improvements

Implement Index Lifecycle Management (ILM):

// Elasticsearch ILM policy
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "freeze": {},
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Apply to index template
PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs",
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

Elasticsearch monitoring:

- alert: ElasticsearchIndexingBacklog
  expr: elasticsearch_indices_indexing_index_current > 1000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Elasticsearch has {{ $value }} pending indexing operations"

- alert: ElasticsearchDiskUsage
  expr: |
    100 - (elasticsearch_filesystem_data_available_bytes /
           elasticsearch_filesystem_data_size_bytes * 100) > 85
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Elasticsearch disk usage >85%"

4. Kubernetes Best Practices

Set resource limits on DaemonSets:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
spec:
  template:
    spec:
      containers:
      - name: fluentd
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
            ephemeral-storage: "2Gi"  # Request storage
          limits:
            memory: "512Mi"
            cpu: "500m"
            ephemeral-storage: "5Gi"  # Limit storage

Kubelet eviction thresholds:

# kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"      # Evict at 90% disk usage
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "200Mi"
  nodefs.available: "15%"      # Warn at 85% disk usage
  nodefs.inodesFree: "10%"
  imagefs.available: "20%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"
  nodefs.inodesFree: "2m"
  imagefs.available: "2m"

5. Log Sampling and Filtering

Reduce log volume with sampling:

# fluentd configuration with sampling
<filter kubernetes.**>
  @type sampling
  sample_rate 10  # Keep only 10% of debug logs
  tag_key level
  tag_pattern /DEBUG/
</filter>

<filter kubernetes.**>
  @type grep
  <exclude>
    key log
    pattern /healthcheck|heartbeat/  # Exclude health checks
  </exclude>
</filter>

<filter kubernetes.**>
  @type record_transformer
  enable_ruby true
  <record>
    # Add metadata but keep messages short
    log ${record["log"][0..1000]}  # Truncate to 1000 chars
  </record>
</filter>

Lessons Learned

What Went Well โœ“

  1. Alert fired early - Caught at 85% before critical failure
  2. Manual cleanup effective - Emergency response freed space quickly
  3. No data loss - All critical logs already in Elasticsearch
  4. Good pod distribution - Only 3/8 nodes affected
  5. Kubernetes resilience - Automatically rescheduled evicted pods
  6. Team response - Quick identification of root cause

What Went Wrong โœ—

  1. No buffer limits - Fluentd allowed unlimited disk usage
  2. Using hostPath - Direct host disk access without limits
  3. No log rotation - Buffer files accumulated indefinitely
  4. ES backpressure ignored - No alerting on indexing backlog
  5. No disk trend monitoring - Problem built up over 7 days
  6. Testing gap - Didn’t test behavior when ES was slow
  7. No cleanup automation - Manual process required

Surprises ๐Ÿ˜ฎ

  1. How fast nodes filled - 85% to 98% in 20 minutes
  2. Buffer files size - Single node accumulated 82GB
  3. Cascading effect - Node issues caused more logs, making it worse
  4. Pod eviction impact - Users noticed immediately
  5. Elasticsearch was bottleneck - Logging agent issue, but ES caused it

Action Items

Completed โœ…

ActionOwnerCompleted
Emergency disk cleanup on affected nodesSRE Team2025-07-22
Add buffer size limits to FluentdSRE Team2025-07-22
Change to emptyDir with size limitsSRE Team2025-07-22
Deploy log cleanup CronJobPlatform Team2025-07-22
Add disk space trend monitoringSRE Team2025-07-23

In Progress ๐Ÿ”„

ActionOwnerTarget Date
Implement Elasticsearch ILM policiesPlatform Team2025-07-30
Add log sampling for debug logsDev Team2025-08-05
Review all DaemonSets for resource limitsSRE Team2025-08-10

Planned โณ

ActionOwnerTarget Date
Increase node disk size from 100GB to 200GBInfrastructure Team2025-08-15
Implement dedicated logging nodesPlatform Team2025-09-01
Add log volume budgets per namespacePlatform Team2025-09-15
Chaos testing: Simulate slow ElasticsearchSRE Team2025-10-01

Technical Deep Dive

Kubernetes Disk Pressure Eviction

How Kubernetes handles disk pressure:

1. Kubelet monitors disk usage every 10 seconds
2. When threshold exceeded:
   โ”œโ”€ Set node condition: DiskPressure=True
   โ”œโ”€ Stop scheduling new pods on node
   โ””โ”€ Start evicting pods

3. Pod eviction priority (lowest to highest):
   โ”œโ”€ BestEffort pods (no resources set)
   โ”œโ”€ Burstable pods (requests < limits)
   โ””โ”€ Guaranteed pods (requests = limits)

4. Within same QoS class:
   โ”œโ”€ Pods exceeding resource requests
   โ””โ”€ Pods under resource requests (by creation time)

During our incident:

# Node status showed disk pressure
kubectl describe node node-worker-3

Conditions:
  Type             Status  LastHeartbeatTime   Reason
  ----             ------  -----------------   ------
  DiskPressure     True    Jul 22 11:40 UTC    KubeletHasDiskPressure

Events:
  Type     Reason            Message
  ----     ------            -------
  Warning  EvictionStarted   Evicting pod nginx-7d8b49c8-xyz (BestEffort)
  Warning  EvictionStarted   Evicting pod api-worker-abc-123 (Burstable)

Log Volume Calculations

Our log volume:

Cluster setup:
- 8 nodes
- 150 pods average per node
- 1200 total pods

Log generation per pod:
- Average: 10 KB/sec per pod
- Total: 1200 ร— 10 KB/sec = 12 MB/sec
- Per day: 12 MB/sec ร— 86400 = 1036 GB/day โ‰ˆ 1 TB/day!

Elasticsearch storage:
- 30-day retention
- Needed: 30 TB
- Had: 10 TB (problem!)
- With ILM (compression + warm tier): ~12 TB needed

Fluentd Buffer Mechanics

How Fluentd buffering works:

1. Read log โ†’ 2. Buffer โ†’ 3. Send to destination

Buffer types:
โ”œโ”€ Memory buffer (fast, but limited)
โ””โ”€ File buffer (slower, but persistent)

File buffer process:
1. Chunk created (default 8MB)
2. Chunk filled with log lines
3. Chunk "staged" when full or flush_interval
4. Chunk sent to destination
5. Chunk deleted on success

Problem scenario:
1. Chunk staged โ†’ 2. Send fails โ†’ 3. Retry โ†’ 4. Keep failing
   โ†’ Chunks accumulate on disk!

Our issue:

Normal operation:
- Chunk created: 8MB every 30 seconds
- Send time: 2 seconds
- Chunks on disk: 1-2 (16MB)

During ES slowdown:
- Chunk created: 8MB every 30 seconds
- Send time: 120 seconds (slow!)
- Chunks on disk: 60+ (480MB)

After 7 days:
- Failed chunks accumulated: 82GB
- Retry queue: 10,000+ files
- Disk: 98% full

Appendix

Useful Commands

Check disk usage on nodes:

# Via kubectl
kubectl top nodes

# Detailed on specific node
kubectl get --raw /api/v1/nodes/node-worker-3/proxy/stats/summary | jq '.node.fs'

# SSH to node
ssh node-worker-3
df -h
du -sh /var/log/* | sort -h
du -sh /mnt/fluentd-buffer/*

Check pod evictions:

# See evicted pods
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.status.reason == "Evicted") |
    "\(.metadata.namespace)/\(.metadata.name)"'

# Count evictions
kubectl get events --all-namespaces | grep -i evicted | wc -l

Monitor Fluentd buffer:

# Check buffer size
kubectl exec -it fluentd-xyz -n logging -- \
  du -sh /var/fluentd/buffer

# Check buffer file count
kubectl exec -it fluentd-xyz -n logging -- \
  find /var/fluentd/buffer -name "buffer.*" | wc -l

# Check Fluentd metrics
kubectl exec -it fluentd-xyz -n logging -- \
  curl localhost:24220/api/plugins.json | jq '.plugins[] | select(.type == "output")'

Emergency cleanup script:

#!/bin/bash
# emergency-disk-cleanup.sh

NODE=$1
THRESHOLD=90

current_usage=$(ssh $NODE "df / | tail -1 | awk '{print \$5}' | sed 's/%//'")

if [ "$current_usage" -gt "$THRESHOLD" ]; then
  echo "Node $NODE at ${current_usage}%, cleaning..."

  ssh $NODE << 'EOF'
    # Stop log collection temporarily
    systemctl stop fluentd

    # Clean old logs
    find /var/log/containers -name "*.log" -mtime +1 -delete
    find /mnt/fluentd-buffer -type f -mmin +60 -delete

    # Clear journal logs
    journalctl --vacuum-time=3d

    # Restart fluentd
    systemctl start fluentd

    # Report results
    df -h /
EOF
fi

References


Incident Commander: Rachel Foster Contributors: Tom Brady (On-call), Sam Mitchell (Platform), Diana Prince (Dev) Postmortem Completed: 2025-07-23 Next Review: 2025-08-23 (1 month follow-up)