Incident: Kubernetes OOMKilled - Memory Leak in Production

Incident Summary

Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates

Quick Facts

Users Affected: ~30% of users experiencing slow responses
Services Affected: User API service
Error Rate: Spiked from 0.5% to 8%
SLO Impact: 25% of monthly error budget consumed

Timeline

14:30 - Prometheus alert: High pod restart rate detected
14:31 - On-call engineer (Dave) acknowledged, investigating
14:33 - Observed pattern: Pods restarting every 15-20 minutes
14:35 - Checked pod status: OOMKilled (exit code 137)
14:37 - Senior SRE (Emma) joined investigation
14:40 - Checked resource limits: 512MB memory limit per pod
14:42 - Reviewed recent deployments: New caching feature deployed yesterday
14:45 - Examined memory metrics: Linear growth from 100MB → 512MB over 15 min
14:50 - Hypothesis: Memory leak in new caching code
14:52 - Decision: Increase memory limit to 1GB as temporary mitigation
14:55 - Memory limit increased, pods restarted with new limits
15:00 - Pod restart frequency decreased (now every ~30 minutes)
15:05 - Confirmed leak still present, just slower with more memory
15:10 - Development team engaged to investigate caching code
15:25 - Memory leak identified: Event listeners not being removed
15:35 - Fix developed and tested locally
15:45 - Hotfix deployed to production
16:00 - Memory usage stabilized at ~180MB
16:15 - Monitoring shows no growth, pods stable
16:30 - Error rate returned to baseline
16:45 - Incident marked as resolved

Root Cause Analysis

What Happened

On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.

The bug:

// Problematic code
class CacheManager {
  constructor() {
    this.cache = new Map();
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);

    // BUG: Event listener added every time getData is called!
    database.on('change', () => {
      this.cache.delete(key);
    });

    return data;
  }
}

The problem: Every call to getData() added a new event listener without removing old ones. With thousands of requests per minute:

After 1 minute:  ~1,000 event listeners
After 5 minutes: ~5,000 event listeners
After 15 minutes: ~15,000 event listeners → 512MB limit reached → OOMKilled

Memory Leak Visualization

Memory Usage Over Time:
512MB ├─────────────────────X (OOMKilled, pod restarts)
      │                   /
      │                 /
256MB │               /
      │             /
      │           /
100MB │─────────/
      └────────────────────────
      0min    5min    10min   15min

Why It Wasn’t Caught

Load testing insufficient - Ran for only 5 minutes (leak appears after 10+)
Code review miss - Reviewers focused on caching logic, not event handling
No memory profiling - Didn’t profile memory usage during testing
Staging environment - Low traffic, leak manifested slowly

Immediate Fix

Step 1: Temporary Mitigation

# Increase memory limit from 512MB to 1GB
kubectl set resources deployment/user-api \
  --limits=memory=1Gi \
  --requests=memory=512Mi \
  -n production

# This buys time but doesn't fix the leak

Result: Pods crashed every ~30 minutes instead of every ~15 minutes

Step 2: Hotfix Deployment

Fixed code:

class CacheManager {
  constructor() {
    this.cache = new Map();
    this.listeners = new Map();
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);

    // FIX: Only add listener once per key
    if (!this.listeners.has(key)) {
      const listener = () => {
        this.cache.delete(key);
      };
      database.on('change', listener);
      this.listeners.set(key, listener);
    }

    return data;
  }

  // Also added cleanup method
  clearKey(key) {
    this.cache.delete(key);
    const listener = this.listeners.get(key);
    if (listener) {
      database.off('change', listener);
      this.listeners.delete(key);
    }
  }
}

Deployed:

# Deploy hotfix
kubectl set image deployment/user-api \
  user-api=user-api:v2.3.1-hotfix \
  -n production

# Monitor rollout
kubectl rollout status deployment/user-api -n production

Result: Memory usage stabilized, no more crashes

Long-term Prevention

Code Improvements

1. Better implementation (used long-term):

class CacheManager {
  constructor() {
    this.cache = new Map();

    // Single global listener instead of per-key listeners
    this.changeHandler = (event) => {
      const affectedKey = event.key;
      this.cache.delete(affectedKey);
    };

    database.on('change', this.changeHandler);
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);
    return data;
  }

  // Cleanup when cache manager destroyed
  destroy() {
    database.off('change', this.changeHandler);
    this.cache.clear();
  }
}

2. Added memory leak detection:

// Warn if listener count grows too large
setInterval(() => {
  const listenerCount = database.listenerCount('change');
  if (listenerCount > 100) {
    logger.warn(`High listener count: ${listenerCount}`);
    metrics.record('listener_count_high', listenerCount);
  }
}, 60000); // Check every minute

Monitoring Enhancements

1. Memory growth alerts:

# Prometheus alert
- alert: MemoryGrowthAnomalous
  expr: |
    (container_memory_usage_bytes{pod=~"user-api.*"} -
     container_memory_usage_bytes{pod=~"user-api.*"} offset 10m) > 100000000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Memory growing >100MB in 10 minutes"

2. OOMKill alerts:

- alert: PodOOMKilled
  expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
  labels:
    severity: critical
  annotations:
    summary: "Pod killed due to OOM"
    description: "Investigate memory limits and potential memory leaks"

Testing Improvements

1. Long-running load tests:

// Added to CI/CD pipeline
describe('Memory leak tests', () => {
  it('should not leak memory over 30 minutes', async () => {
    const initialMemory = process.memoryUsage().heapUsed;

    // Simulate 30 minutes of traffic
    for (let i = 0; i < 100000; i++) {
      await cache.getData(`key-${i % 1000}`);

      // Sample memory every 1000 iterations
      if (i % 1000 === 0) {
        const currentMemory = process.memoryUsage().heapUsed;
        const growth = currentMemory - initialMemory;

        // Memory shouldn't grow more than 50MB
        expect(growth).toBeLessThan(50 * 1024 * 1024);
      }
    }
  });
});

2. Memory profiling:

# Added to staging deployment process
# Run with --inspect flag for memory profiling
node --inspect=0.0.0.0:9229 server.js

# Take heap snapshots periodically
node --inspect --heapsnapshot-signal=SIGUSR2 server.js

Resource Limit Adjustments

Final configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-api
spec:
  template:
    spec:
      containers:
      - name: user-api
        resources:
          requests:
            memory: "256Mi"  # Baseline
            cpu: "200m"
          limits:
            memory: "512Mi"  # Reduced from 1GB after leak fixed
            cpu: "500m"

Lessons Learned

What Went Well ✓

Quick detection - Alert fired within minutes of elevated restarts
Fast diagnosis - Identified OOMKilled in 5 minutes
Team collaboration - Dev and SRE worked together effectively
Good hypothesis - Correctly suspected recent deployment
Effective hotfix - Fix developed and deployed in under 2 hours
Minimal user impact - Degradation but no complete outage

What Went Wrong ✗

Memory leak shipped to production - Should have been caught in testing
Load test too short - 5 minutes insufficient to reveal 15-minute leak
No memory profiling - Never profiled memory usage during development
Resource limits too tight - 512MB left no buffer for issues
Temporary fix attempted first - Increased memory instead of fixing leak
No automated memory testing - Should be part of CI/CD

Surprises 😮

How fast it leaked - 400MB in 15 minutes with normal traffic
Linear growth - Very consistent, predictable leak pattern
Node.js event emitters - Easy to create memory leaks with listeners
OOMKilled cascading - Pod restarts caused brief service blips
Monitoring saved us - Without metrics, would have been much harder to diagnose

Action Items

Completed ✅

Action	Owner	Completed
Deploy hotfix removing event listener leak	Dev Team	2025-09-28
Add memory growth alerts	SRE Team	2025-09-28
Add OOMKill alerts	SRE Team	2025-09-28
Revert memory limit to 512MB	SRE Team	2025-09-29

In Progress 🔄

Action	Owner	Target Date
Implement long-running memory tests	QA Team	2025-10-05
Add memory profiling to staging	Platform Team	2025-10-10
Document event listener best practices	Tech Lead	2025-10-12

Planned ⏳

Action	Owner	Target Date
Audit all event listeners across codebase	Dev Team	2025-10-20
Implement automated memory leak detection	Platform Team	2025-11-01
Add heap snapshot analysis to CI/CD	DevOps Team	2025-11-15

Technical Deep Dive

Node.js Memory Management

How V8 heap works:

Total Memory Limit: 512MB
├─ Young Generation: ~100MB (new objects)
│  └─ Garbage collected frequently
├─ Old Generation: ~400MB (long-lived objects)
│  └─ Garbage collected less frequently
└─ Code, Stack, etc: ~12MB

Event listener leak impact:

// Each listener ~500 bytes
15,000 listeners × 500 bytes = 7.5MB for listeners alone

// But each listener references:
// - Callback function
// - Closure variables
// - Context

// Actual memory per listener: ~20KB
15,000 listeners × 20KB = 300MB of leaked memory!

Detecting Memory Leaks

1. Memory usage trending up:

# Monitor heap usage
curl http://localhost:9229/json/version
# Use Chrome DevTools to connect and monitor

2. Event listener count:

// In Node.js
process._getActiveHandlers()
process._getActiveRequests()

// Check EventEmitter listener count
emitter.listenerCount('event')

3. Heap snapshots:

# Take heap snapshot
kill -SIGUSR2 <pid>

# Analyze with Chrome DevTools
# Look for:
# - Objects growing over time
# - Unexpected object retention
# - Large arrays or maps

Appendix

Useful Commands

Check pod memory:

kubectl top pod user-api-xyz -n production

Get OOMKilled pods:

kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}' | grep OOMKilled

Describe pod to see last termination:

kubectl describe pod user-api-xyz -n production | grep -A5 "Last State"

Node.js memory debugging:

// Log memory usage
console.log(process.memoryUsage());
// {
//   rss: 123456789,        // Total memory
//   heapTotal: 87654321,   // Total heap
//   heapUsed: 56789012,    // Used heap
//   external: 1234567      // C++ objects
// }

// Force garbage collection (--expose-gc flag required)
global.gc();

References

Incident Commander: Dave Thompson Contributors: Emma Rodriguez (SRE), Frank Lee (Dev), Grace Kim (QA) Postmortem Completed: 2025-09-29 Next Review: 2025-10-29 (1 month follow-up)

Incident Summary#

Quick Facts#

Timeline#

Root Cause Analysis#

What Happened#

Memory Leak Visualization#

Why It Wasn’t Caught#

Immediate Fix#

Step 1: Temporary Mitigation#

Step 2: Hotfix Deployment#

Long-term Prevention#

Code Improvements#

Monitoring Enhancements#

Testing Improvements#

Resource Limit Adjustments#

Lessons Learned#

What Went Well ✓#

What Went Wrong ✗#

Surprises 😮#

Action Items#

Completed ✅#

In Progress 🔄#

Planned ⏳#

Technical Deep Dive#

Node.js Memory Management#

Detecting Memory Leaks#

Appendix#

Useful Commands#

References#