Incident: Database Connection Pool Exhaustion
Incident Summary Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users
Quick Facts Users Affected: ~2,000 active users Services Affected: API, Admin Dashboard, Mobile App Revenue Impact: ~$4,500 in lost transactions SLO Impact: Consumed 45% of monthly error budget Timeline 03:15:00 - PagerDuty alert fired: API health check failures 03:15:30 - On-call engineer (Alice) acknowledged alert 03:16:00 - Initial investigation: All API pods showing healthy status 03:17:00 - Checked application logs: “connection timeout” errors appearing 03:18:00 - Senior engineer (Bob) joined incident response 03:19:00 - Identified pattern: All database connection attempts timing out 03:20:00 - Checked database status: PostgreSQL running normally 03:22:00 - Checked connection pool metrics: 100/100 connections in use 03:23:00 - Root cause identified: Background job leaking connections 03:25:00 - Decision made to restart API pods to release connections 03:27:00 - Rolling restart initiated for API deployment 03:30:00 - First pods restarted, connection pool draining 03:33:00 - 50% of pods restarted, API partially operational 03:35:00 - All pods restarted, connection pool normalized 03:36:00 - Smoke tests passed, API fully operational 03:38:00 - Incident marked as resolved 03:45:00 - Post-incident monitoring confirmed stability Root Cause Analysis What Happened The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).
…