Incident: Kubernetes OOMKilled - Memory Leak in Production
Incident Summary Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates
Quick Facts Users Affected: ~30% of users experiencing slow responses Services Affected: User API service Error Rate: Spiked from 0.5% to 8% SLO Impact: 25% of monthly error budget consumed Timeline 14:30 - Prometheus alert: High pod restart rate detected 14:31 - On-call engineer (Dave) acknowledged, investigating 14:33 - Observed pattern: Pods restarting every 15-20 minutes 14:35 - Checked pod status: OOMKilled (exit code 137) 14:37 - Senior SRE (Emma) joined investigation 14:40 - Checked resource limits: 512MB memory limit per pod 14:42 - Reviewed recent deployments: New caching feature deployed yesterday 14:45 - Examined memory metrics: Linear growth from 100MB → 512MB over 15 min 14:50 - Hypothesis: Memory leak in new caching code 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation 14:55 - Memory limit increased, pods restarted with new limits 15:00 - Pod restart frequency decreased (now every ~30 minutes) 15:05 - Confirmed leak still present, just slower with more memory 15:10 - Development team engaged to investigate caching code 15:25 - Memory leak identified: Event listeners not being removed 15:35 - Fix developed and tested locally 15:45 - Hotfix deployed to production 16:00 - Memory usage stabilized at ~180MB 16:15 - Monitoring shows no growth, pods stable 16:30 - Error rate returned to baseline 16:45 - Incident marked as resolved Root Cause Analysis What Happened On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.
…