Incident Summary

Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates

Quick Facts

  • Users Affected: ~30% of users experiencing slow responses
  • Services Affected: User API service
  • Error Rate: Spiked from 0.5% to 8%
  • SLO Impact: 25% of monthly error budget consumed

Timeline

  • 14:30 - Prometheus alert: High pod restart rate detected
  • 14:31 - On-call engineer (Dave) acknowledged, investigating
  • 14:33 - Observed pattern: Pods restarting every 15-20 minutes
  • 14:35 - Checked pod status: OOMKilled (exit code 137)
  • 14:37 - Senior SRE (Emma) joined investigation
  • 14:40 - Checked resource limits: 512MB memory limit per pod
  • 14:42 - Reviewed recent deployments: New caching feature deployed yesterday
  • 14:45 - Examined memory metrics: Linear growth from 100MB โ†’ 512MB over 15 min
  • 14:50 - Hypothesis: Memory leak in new caching code
  • 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation
  • 14:55 - Memory limit increased, pods restarted with new limits
  • 15:00 - Pod restart frequency decreased (now every ~30 minutes)
  • 15:05 - Confirmed leak still present, just slower with more memory
  • 15:10 - Development team engaged to investigate caching code
  • 15:25 - Memory leak identified: Event listeners not being removed
  • 15:35 - Fix developed and tested locally
  • 15:45 - Hotfix deployed to production
  • 16:00 - Memory usage stabilized at ~180MB
  • 16:15 - Monitoring shows no growth, pods stable
  • 16:30 - Error rate returned to baseline
  • 16:45 - Incident marked as resolved

Root Cause Analysis

What Happened

On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.

The bug:

// Problematic code
class CacheManager {
  constructor() {
    this.cache = new Map();
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);

    // BUG: Event listener added every time getData is called!
    database.on('change', () => {
      this.cache.delete(key);
    });

    return data;
  }
}

The problem: Every call to getData() added a new event listener without removing old ones. With thousands of requests per minute:

After 1 minute:  ~1,000 event listeners
After 5 minutes: ~5,000 event listeners
After 15 minutes: ~15,000 event listeners โ†’ 512MB limit reached โ†’ OOMKilled

Memory Leak Visualization

Memory Usage Over Time:
512MB โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€X (OOMKilled, pod restarts)
      โ”‚                   /
      โ”‚                 /
256MB โ”‚               /
      โ”‚             /
      โ”‚           /
100MB โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€/
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
      0min    5min    10min   15min

Why It Wasn’t Caught

  1. Load testing insufficient - Ran for only 5 minutes (leak appears after 10+)
  2. Code review miss - Reviewers focused on caching logic, not event handling
  3. No memory profiling - Didn’t profile memory usage during testing
  4. Staging environment - Low traffic, leak manifested slowly

Immediate Fix

Step 1: Temporary Mitigation

# Increase memory limit from 512MB to 1GB
kubectl set resources deployment/user-api \
  --limits=memory=1Gi \
  --requests=memory=512Mi \
  -n production

# This buys time but doesn't fix the leak

Result: Pods crashed every ~30 minutes instead of every ~15 minutes

Step 2: Hotfix Deployment

Fixed code:

class CacheManager {
  constructor() {
    this.cache = new Map();
    this.listeners = new Map();
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);

    // FIX: Only add listener once per key
    if (!this.listeners.has(key)) {
      const listener = () => {
        this.cache.delete(key);
      };
      database.on('change', listener);
      this.listeners.set(key, listener);
    }

    return data;
  }

  // Also added cleanup method
  clearKey(key) {
    this.cache.delete(key);
    const listener = this.listeners.get(key);
    if (listener) {
      database.off('change', listener);
      this.listeners.delete(key);
    }
  }
}

Deployed:

# Deploy hotfix
kubectl set image deployment/user-api \
  user-api=user-api:v2.3.1-hotfix \
  -n production

# Monitor rollout
kubectl rollout status deployment/user-api -n production

Result: Memory usage stabilized, no more crashes

Long-term Prevention

Code Improvements

1. Better implementation (used long-term):

class CacheManager {
  constructor() {
    this.cache = new Map();

    // Single global listener instead of per-key listeners
    this.changeHandler = (event) => {
      const affectedKey = event.key;
      this.cache.delete(affectedKey);
    };

    database.on('change', this.changeHandler);
  }

  async getData(key) {
    if (this.cache.has(key)) {
      return this.cache.get(key);
    }

    const data = await database.query(key);
    this.cache.set(key, data);
    return data;
  }

  // Cleanup when cache manager destroyed
  destroy() {
    database.off('change', this.changeHandler);
    this.cache.clear();
  }
}

2. Added memory leak detection:

// Warn if listener count grows too large
setInterval(() => {
  const listenerCount = database.listenerCount('change');
  if (listenerCount > 100) {
    logger.warn(`High listener count: ${listenerCount}`);
    metrics.record('listener_count_high', listenerCount);
  }
}, 60000); // Check every minute

Monitoring Enhancements

1. Memory growth alerts:

# Prometheus alert
- alert: MemoryGrowthAnomalous
  expr: |
    (container_memory_usage_bytes{pod=~"user-api.*"} -
     container_memory_usage_bytes{pod=~"user-api.*"} offset 10m) > 100000000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Memory growing >100MB in 10 minutes"

2. OOMKill alerts:

- alert: PodOOMKilled
  expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
  labels:
    severity: critical
  annotations:
    summary: "Pod killed due to OOM"
    description: "Investigate memory limits and potential memory leaks"

Testing Improvements

1. Long-running load tests:

// Added to CI/CD pipeline
describe('Memory leak tests', () => {
  it('should not leak memory over 30 minutes', async () => {
    const initialMemory = process.memoryUsage().heapUsed;

    // Simulate 30 minutes of traffic
    for (let i = 0; i < 100000; i++) {
      await cache.getData(`key-${i % 1000}`);

      // Sample memory every 1000 iterations
      if (i % 1000 === 0) {
        const currentMemory = process.memoryUsage().heapUsed;
        const growth = currentMemory - initialMemory;

        // Memory shouldn't grow more than 50MB
        expect(growth).toBeLessThan(50 * 1024 * 1024);
      }
    }
  });
});

2. Memory profiling:

# Added to staging deployment process
# Run with --inspect flag for memory profiling
node --inspect=0.0.0.0:9229 server.js

# Take heap snapshots periodically
node --inspect --heapsnapshot-signal=SIGUSR2 server.js

Resource Limit Adjustments

Final configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-api
spec:
  template:
    spec:
      containers:
      - name: user-api
        resources:
          requests:
            memory: "256Mi"  # Baseline
            cpu: "200m"
          limits:
            memory: "512Mi"  # Reduced from 1GB after leak fixed
            cpu: "500m"

Lessons Learned

What Went Well โœ“

  1. Quick detection - Alert fired within minutes of elevated restarts
  2. Fast diagnosis - Identified OOMKilled in 5 minutes
  3. Team collaboration - Dev and SRE worked together effectively
  4. Good hypothesis - Correctly suspected recent deployment
  5. Effective hotfix - Fix developed and deployed in under 2 hours
  6. Minimal user impact - Degradation but no complete outage

What Went Wrong โœ—

  1. Memory leak shipped to production - Should have been caught in testing
  2. Load test too short - 5 minutes insufficient to reveal 15-minute leak
  3. No memory profiling - Never profiled memory usage during development
  4. Resource limits too tight - 512MB left no buffer for issues
  5. Temporary fix attempted first - Increased memory instead of fixing leak
  6. No automated memory testing - Should be part of CI/CD

Surprises ๐Ÿ˜ฎ

  1. How fast it leaked - 400MB in 15 minutes with normal traffic
  2. Linear growth - Very consistent, predictable leak pattern
  3. Node.js event emitters - Easy to create memory leaks with listeners
  4. OOMKilled cascading - Pod restarts caused brief service blips
  5. Monitoring saved us - Without metrics, would have been much harder to diagnose

Action Items

Completed โœ…

ActionOwnerCompleted
Deploy hotfix removing event listener leakDev Team2025-09-28
Add memory growth alertsSRE Team2025-09-28
Add OOMKill alertsSRE Team2025-09-28
Revert memory limit to 512MBSRE Team2025-09-29

In Progress ๐Ÿ”„

ActionOwnerTarget Date
Implement long-running memory testsQA Team2025-10-05
Add memory profiling to stagingPlatform Team2025-10-10
Document event listener best practicesTech Lead2025-10-12

Planned โณ

ActionOwnerTarget Date
Audit all event listeners across codebaseDev Team2025-10-20
Implement automated memory leak detectionPlatform Team2025-11-01
Add heap snapshot analysis to CI/CDDevOps Team2025-11-15

Technical Deep Dive

Node.js Memory Management

How V8 heap works:

Total Memory Limit: 512MB
โ”œโ”€ Young Generation: ~100MB (new objects)
โ”‚  โ””โ”€ Garbage collected frequently
โ”œโ”€ Old Generation: ~400MB (long-lived objects)
โ”‚  โ””โ”€ Garbage collected less frequently
โ””โ”€ Code, Stack, etc: ~12MB

Event listener leak impact:

// Each listener ~500 bytes
15,000 listeners ร— 500 bytes = 7.5MB for listeners alone

// But each listener references:
// - Callback function
// - Closure variables
// - Context

// Actual memory per listener: ~20KB
15,000 listeners ร— 20KB = 300MB of leaked memory!

Detecting Memory Leaks

1. Memory usage trending up:

# Monitor heap usage
curl http://localhost:9229/json/version
# Use Chrome DevTools to connect and monitor

2. Event listener count:

// In Node.js
process._getActiveHandlers()
process._getActiveRequests()

// Check EventEmitter listener count
emitter.listenerCount('event')

3. Heap snapshots:

# Take heap snapshot
kill -SIGUSR2 <pid>

# Analyze with Chrome DevTools
# Look for:
# - Objects growing over time
# - Unexpected object retention
# - Large arrays or maps

Appendix

Useful Commands

Check pod memory:

kubectl top pod user-api-xyz -n production

Get OOMKilled pods:

kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}' | grep OOMKilled

Describe pod to see last termination:

kubectl describe pod user-api-xyz -n production | grep -A5 "Last State"

Node.js memory debugging:

// Log memory usage
console.log(process.memoryUsage());
// {
//   rss: 123456789,        // Total memory
//   heapTotal: 87654321,   // Total heap
//   heapUsed: 56789012,    // Used heap
//   external: 1234567      // C++ objects
// }

// Force garbage collection (--expose-gc flag required)
global.gc();

References


Incident Commander: Dave Thompson Contributors: Emma Rodriguez (SRE), Frank Lee (Dev), Grace Kim (QA) Postmortem Completed: 2025-09-29 Next Review: 2025-10-29 (1 month follow-up)