🚨 Incident
7 min
Incident: Database Connection Pool Exhaustion
Incident Summary Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users
Quick Facts Users Affected: ~2,000 active users Services Affected: API, Admin Dashboard, Mobile App Revenue Impact: ~$4,500 in lost transactions SLO Impact: Consumed 45% of monthly error budget Timeline 03:15:00 - PagerDuty alert fired: API health check failures 03:15:30 - On-call engineer (Alice) acknowledged alert 03:16:00 - Initial investigation: All API pods showing healthy status 03:17:00 - Checked application logs: “connection timeout” errors appearing 03:18:00 - Senior engineer (Bob) joined incident response 03:19:00 - Identified pattern: All database connection attempts timing out 03:20:00 - Checked database status: PostgreSQL running normally 03:22:00 - Checked connection pool metrics: 100/100 connections in use 03:23:00 - Root cause identified: Background job leaking connections 03:25:00 - Decision made to restart API pods to release connections 03:27:00 - Rolling restart initiated for API deployment 03:30:00 - First pods restarted, connection pool draining 03:33:00 - 50% of pods restarted, API partially operational 03:35:00 - All pods restarted, connection pool normalized 03:36:00 - Smoke tests passed, API fully operational 03:38:00 - Incident marked as resolved 03:45:00 - Post-incident monitoring confirmed stability Root Cause Analysis What Happened The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).
…
October 14, 2025 · 7 min · DevOps Engineer
🚨 Incident
11 min
Incident: Redis Cache Failure Causes Cascading Database Load
Incident Summary Date: 2025-09-05 Time: 09:45 UTC Duration: 1 hour 32 minutes Severity: SEV-1 (Critical) Impact: Severe performance degradation affecting 85% of users
Quick Facts Users Affected: ~8,500 active users (85%) Services Affected: Web Application, Mobile API, Admin Dashboard Response Time: P95 latency increased from 200ms to 45 seconds Revenue Impact: ~$18,000 in lost sales and abandoned carts SLO Impact: 70% of monthly error budget consumed Timeline 09:45:00 - Redis cluster health check alert: Node down 09:45:15 - Application latency spiked dramatically 09:45:30 - PagerDuty alert: P95 latency > 10 seconds 09:46:00 - On-call engineer (Sarah) acknowledged alert 09:47:00 - Database CPU spiked to 95% utilization 09:48:00 - Database connection pool approaching limits (180/200) 09:49:00 - User complaints started flooding support channels 09:50:00 - Senior SRE (Marcus) joined incident response 09:52:00 - Checked Redis status: Master node unresponsive 09:54:00 - Identified: Redis master failure, failover not working 09:56:00 - Incident escalated to SEV-1, incident commander assigned 09:58:00 - Attempted automatic failover: Failed 10:00:00 - Decision: Manual promotion of Redis replica to master 10:03:00 - Promoted replica-1 to master manually 10:05:00 - Updated application config to point to new master 10:08:00 - Rolling restart of application pods initiated 10:15:00 - 50% of pods restarted with new Redis endpoint 10:18:00 - Cache warming started for critical keys 10:22:00 - Database load starting to decrease (CPU: 65%) 10:25:00 - P95 latency improved to 3 seconds 10:30:00 - All pods restarted, cache rebuild in progress 10:40:00 - P95 latency down to 800ms 10:50:00 - Cache fully populated, metrics returning to normal 11:05:00 - P95 latency at 220ms (near baseline) 11:17:00 - Incident marked as resolved 11:30:00 - Post-incident monitoring confirmed stability Root Cause Analysis What Happened The production Redis cluster consisted of 1 master and 2 replicas running Redis Sentinel for high availability. On September 5th at 09:45 UTC, the Redis master node experienced a kernel panic due to an underlying infrastructure issue.
…
September 5, 2025 · 11 min · DevOps Engineer