Incident Summary
Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates
Quick Facts
- Users Affected: ~30% of users experiencing slow responses
- Services Affected: User API service
- Error Rate: Spiked from 0.5% to 8%
- SLO Impact: 25% of monthly error budget consumed
Timeline
- 14:30 - Prometheus alert: High pod restart rate detected
- 14:31 - On-call engineer (Dave) acknowledged, investigating
- 14:33 - Observed pattern: Pods restarting every 15-20 minutes
- 14:35 - Checked pod status:
OOMKilled
(exit code 137) - 14:37 - Senior SRE (Emma) joined investigation
- 14:40 - Checked resource limits: 512MB memory limit per pod
- 14:42 - Reviewed recent deployments: New caching feature deployed yesterday
- 14:45 - Examined memory metrics: Linear growth from 100MB โ 512MB over 15 min
- 14:50 - Hypothesis: Memory leak in new caching code
- 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation
- 14:55 - Memory limit increased, pods restarted with new limits
- 15:00 - Pod restart frequency decreased (now every ~30 minutes)
- 15:05 - Confirmed leak still present, just slower with more memory
- 15:10 - Development team engaged to investigate caching code
- 15:25 - Memory leak identified: Event listeners not being removed
- 15:35 - Fix developed and tested locally
- 15:45 - Hotfix deployed to production
- 16:00 - Memory usage stabilized at ~180MB
- 16:15 - Monitoring shows no growth, pods stable
- 16:30 - Error rate returned to baseline
- 16:45 - Incident marked as resolved
Root Cause Analysis
What Happened
On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.
The bug:
// Problematic code
class CacheManager {
constructor() {
this.cache = new Map();
}
async getData(key) {
if (this.cache.has(key)) {
return this.cache.get(key);
}
const data = await database.query(key);
this.cache.set(key, data);
// BUG: Event listener added every time getData is called!
database.on('change', () => {
this.cache.delete(key);
});
return data;
}
}
The problem: Every call to getData()
added a new event listener without removing old ones. With thousands of requests per minute:
After 1 minute: ~1,000 event listeners
After 5 minutes: ~5,000 event listeners
After 15 minutes: ~15,000 event listeners โ 512MB limit reached โ OOMKilled
Memory Leak Visualization
Memory Usage Over Time:
512MB โโโโโโโโโโโโโโโโโโโโโโX (OOMKilled, pod restarts)
โ /
โ /
256MB โ /
โ /
โ /
100MB โโโโโโโโโโ/
โโโโโโโโโโโโโโโโโโโโโโโโโ
0min 5min 10min 15min
Why It Wasn’t Caught
- Load testing insufficient - Ran for only 5 minutes (leak appears after 10+)
- Code review miss - Reviewers focused on caching logic, not event handling
- No memory profiling - Didn’t profile memory usage during testing
- Staging environment - Low traffic, leak manifested slowly
Immediate Fix
Step 1: Temporary Mitigation
# Increase memory limit from 512MB to 1GB
kubectl set resources deployment/user-api \
--limits=memory=1Gi \
--requests=memory=512Mi \
-n production
# This buys time but doesn't fix the leak
Result: Pods crashed every ~30 minutes instead of every ~15 minutes
Step 2: Hotfix Deployment
Fixed code:
class CacheManager {
constructor() {
this.cache = new Map();
this.listeners = new Map();
}
async getData(key) {
if (this.cache.has(key)) {
return this.cache.get(key);
}
const data = await database.query(key);
this.cache.set(key, data);
// FIX: Only add listener once per key
if (!this.listeners.has(key)) {
const listener = () => {
this.cache.delete(key);
};
database.on('change', listener);
this.listeners.set(key, listener);
}
return data;
}
// Also added cleanup method
clearKey(key) {
this.cache.delete(key);
const listener = this.listeners.get(key);
if (listener) {
database.off('change', listener);
this.listeners.delete(key);
}
}
}
Deployed:
# Deploy hotfix
kubectl set image deployment/user-api \
user-api=user-api:v2.3.1-hotfix \
-n production
# Monitor rollout
kubectl rollout status deployment/user-api -n production
Result: Memory usage stabilized, no more crashes
Long-term Prevention
Code Improvements
1. Better implementation (used long-term):
class CacheManager {
constructor() {
this.cache = new Map();
// Single global listener instead of per-key listeners
this.changeHandler = (event) => {
const affectedKey = event.key;
this.cache.delete(affectedKey);
};
database.on('change', this.changeHandler);
}
async getData(key) {
if (this.cache.has(key)) {
return this.cache.get(key);
}
const data = await database.query(key);
this.cache.set(key, data);
return data;
}
// Cleanup when cache manager destroyed
destroy() {
database.off('change', this.changeHandler);
this.cache.clear();
}
}
2. Added memory leak detection:
// Warn if listener count grows too large
setInterval(() => {
const listenerCount = database.listenerCount('change');
if (listenerCount > 100) {
logger.warn(`High listener count: ${listenerCount}`);
metrics.record('listener_count_high', listenerCount);
}
}, 60000); // Check every minute
Monitoring Enhancements
1. Memory growth alerts:
# Prometheus alert
- alert: MemoryGrowthAnomalous
expr: |
(container_memory_usage_bytes{pod=~"user-api.*"} -
container_memory_usage_bytes{pod=~"user-api.*"} offset 10m) > 100000000
for: 5m
labels:
severity: warning
annotations:
summary: "Memory growing >100MB in 10 minutes"
2. OOMKill alerts:
- alert: PodOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
labels:
severity: critical
annotations:
summary: "Pod killed due to OOM"
description: "Investigate memory limits and potential memory leaks"
Testing Improvements
1. Long-running load tests:
// Added to CI/CD pipeline
describe('Memory leak tests', () => {
it('should not leak memory over 30 minutes', async () => {
const initialMemory = process.memoryUsage().heapUsed;
// Simulate 30 minutes of traffic
for (let i = 0; i < 100000; i++) {
await cache.getData(`key-${i % 1000}`);
// Sample memory every 1000 iterations
if (i % 1000 === 0) {
const currentMemory = process.memoryUsage().heapUsed;
const growth = currentMemory - initialMemory;
// Memory shouldn't grow more than 50MB
expect(growth).toBeLessThan(50 * 1024 * 1024);
}
}
});
});
2. Memory profiling:
# Added to staging deployment process
# Run with --inspect flag for memory profiling
node --inspect=0.0.0.0:9229 server.js
# Take heap snapshots periodically
node --inspect --heapsnapshot-signal=SIGUSR2 server.js
Resource Limit Adjustments
Final configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-api
spec:
template:
spec:
containers:
- name: user-api
resources:
requests:
memory: "256Mi" # Baseline
cpu: "200m"
limits:
memory: "512Mi" # Reduced from 1GB after leak fixed
cpu: "500m"
Lessons Learned
What Went Well โ
- Quick detection - Alert fired within minutes of elevated restarts
- Fast diagnosis - Identified OOMKilled in 5 minutes
- Team collaboration - Dev and SRE worked together effectively
- Good hypothesis - Correctly suspected recent deployment
- Effective hotfix - Fix developed and deployed in under 2 hours
- Minimal user impact - Degradation but no complete outage
What Went Wrong โ
- Memory leak shipped to production - Should have been caught in testing
- Load test too short - 5 minutes insufficient to reveal 15-minute leak
- No memory profiling - Never profiled memory usage during development
- Resource limits too tight - 512MB left no buffer for issues
- Temporary fix attempted first - Increased memory instead of fixing leak
- No automated memory testing - Should be part of CI/CD
Surprises ๐ฎ
- How fast it leaked - 400MB in 15 minutes with normal traffic
- Linear growth - Very consistent, predictable leak pattern
- Node.js event emitters - Easy to create memory leaks with listeners
- OOMKilled cascading - Pod restarts caused brief service blips
- Monitoring saved us - Without metrics, would have been much harder to diagnose
Action Items
Completed โ
Action | Owner | Completed |
---|---|---|
Deploy hotfix removing event listener leak | Dev Team | 2025-09-28 |
Add memory growth alerts | SRE Team | 2025-09-28 |
Add OOMKill alerts | SRE Team | 2025-09-28 |
Revert memory limit to 512MB | SRE Team | 2025-09-29 |
In Progress ๐
Action | Owner | Target Date |
---|---|---|
Implement long-running memory tests | QA Team | 2025-10-05 |
Add memory profiling to staging | Platform Team | 2025-10-10 |
Document event listener best practices | Tech Lead | 2025-10-12 |
Planned โณ
Action | Owner | Target Date |
---|---|---|
Audit all event listeners across codebase | Dev Team | 2025-10-20 |
Implement automated memory leak detection | Platform Team | 2025-11-01 |
Add heap snapshot analysis to CI/CD | DevOps Team | 2025-11-15 |
Technical Deep Dive
Node.js Memory Management
How V8 heap works:
Total Memory Limit: 512MB
โโ Young Generation: ~100MB (new objects)
โ โโ Garbage collected frequently
โโ Old Generation: ~400MB (long-lived objects)
โ โโ Garbage collected less frequently
โโ Code, Stack, etc: ~12MB
Event listener leak impact:
// Each listener ~500 bytes
15,000 listeners ร 500 bytes = 7.5MB for listeners alone
// But each listener references:
// - Callback function
// - Closure variables
// - Context
// Actual memory per listener: ~20KB
15,000 listeners ร 20KB = 300MB of leaked memory!
Detecting Memory Leaks
1. Memory usage trending up:
# Monitor heap usage
curl http://localhost:9229/json/version
# Use Chrome DevTools to connect and monitor
2. Event listener count:
// In Node.js
process._getActiveHandlers()
process._getActiveRequests()
// Check EventEmitter listener count
emitter.listenerCount('event')
3. Heap snapshots:
# Take heap snapshot
kill -SIGUSR2 <pid>
# Analyze with Chrome DevTools
# Look for:
# - Objects growing over time
# - Unexpected object retention
# - Large arrays or maps
Appendix
Useful Commands
Check pod memory:
kubectl top pod user-api-xyz -n production
Get OOMKilled pods:
kubectl get pods -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].lastState.terminated.reason}{"\n"}{end}' | grep OOMKilled
Describe pod to see last termination:
kubectl describe pod user-api-xyz -n production | grep -A5 "Last State"
Node.js memory debugging:
// Log memory usage
console.log(process.memoryUsage());
// {
// rss: 123456789, // Total memory
// heapTotal: 87654321, // Total heap
// heapUsed: 56789012, // Used heap
// external: 1234567 // C++ objects
// }
// Force garbage collection (--expose-gc flag required)
global.gc();
References
Incident Commander: Dave Thompson Contributors: Emma Rodriguez (SRE), Frank Lee (Dev), Grace Kim (QA) Postmortem Completed: 2025-09-29 Next Review: 2025-10-29 (1 month follow-up)