Incident Summary
Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users
Quick Facts
- Users Affected: ~2,000 active users
- Services Affected: API, Admin Dashboard, Mobile App
- Revenue Impact: ~$4,500 in lost transactions
- SLO Impact: Consumed 45% of monthly error budget
Timeline
- 03:15:00 - PagerDuty alert fired: API health check failures
- 03:15:30 - On-call engineer (Alice) acknowledged alert
- 03:16:00 - Initial investigation: All API pods showing healthy status
- 03:17:00 - Checked application logs: “connection timeout” errors appearing
- 03:18:00 - Senior engineer (Bob) joined incident response
- 03:19:00 - Identified pattern: All database connection attempts timing out
- 03:20:00 - Checked database status: PostgreSQL running normally
- 03:22:00 - Checked connection pool metrics: 100/100 connections in use
- 03:23:00 - Root cause identified: Background job leaking connections
- 03:25:00 - Decision made to restart API pods to release connections
- 03:27:00 - Rolling restart initiated for API deployment
- 03:30:00 - First pods restarted, connection pool draining
- 03:33:00 - 50% of pods restarted, API partially operational
- 03:35:00 - All pods restarted, connection pool normalized
- 03:36:00 - Smoke tests passed, API fully operational
- 03:38:00 - Incident marked as resolved
- 03:45:00 - Post-incident monitoring confirmed stability
Root Cause Analysis
What Happened
The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).
The background job’s error handling code had a critical flaw:
# Problematic code
def sync_user_data(user_id):
try:
conn = connection_pool.get_connection()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Process data...
except DataValidationError as e:
logger.error(f"Validation failed: {e}")
return # Connection never released!
The flaw: When a DataValidationError
occurred (malformed data), the exception was caught but the database connection was never returned to the pool.
Chain of Events
- Day 1-2: Background job ran successfully for valid data
- Day 2 evening: Corrupted user data introduced to database
- Day 2-3: Background job encountered validation errors
- Over 48 hours: Leaked connections accumulated (1-2 per hour)
- 03:15 UTC: Last available connection consumed
- 03:15-03:38: All new API requests timing out waiting for connections
Contributing Factors
- Missing
finally
block - No guarantee connection release - No connection pool monitoring - No alerts before exhaustion
- Code review miss - Resource cleanup not verified
- Gradual degradation - Took 48 hours to manifest, hard to correlate
- Test gap - Integration tests didn’t cover error paths with real connections
Immediate Fix
Actions Taken
1. Emergency mitigation (03:25 UTC)
# Rolling restart to release connections
kubectl rollout restart deployment/api-service -n production
# Monitor restart progress
kubectl rollout status deployment/api-service -n production
2. Verify recovery (03:36 UTC)
# Check connection pool metrics
kubectl exec -it api-service-pod-xyz -- \
curl localhost:8080/metrics | grep db_connections_active
# Result: 12/100 connections (healthy)
3. Stop background job (03:40 UTC)
# Prevent recurrence until fixed
kubectl scale deployment/sync-job -n production --replicas=0
Why This Worked
- Restarting pods killed all connections (including leaked ones)
- Fresh pods started with empty connection pools
- Normal API traffic uses connections correctly
- Stopping background job prevented new leaks
Long-term Prevention
Code Fix
Fixed background job (deployed 2025-10-14 10:00 UTC):
def sync_user_data(user_id):
conn = None
try:
conn = connection_pool.get_connection()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Process data...
except DataValidationError as e:
logger.error(f"Validation failed for user {user_id}: {e}")
return
finally:
# ALWAYS release connection
if conn:
connection_pool.release_connection(conn)
Better: Use context manager (Python best practice):
def sync_user_data(user_id):
try:
with connection_pool.get_connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Process data...
# Connection auto-released when exiting 'with' block
except DataValidationError as e:
logger.error(f"Validation failed for user {user_id}: {e}")
return
Monitoring Improvements
1. Connection pool alerts (deployed 2025-10-14 12:00 UTC):
# Prometheus alert rules
groups:
- name: database
rules:
- alert: DatabaseConnectionPoolHigh
expr: db_connection_pool_active / db_connection_pool_max > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Connection pool usage >80%"
- alert: DatabaseConnectionPoolCritical
expr: db_connection_pool_active / db_connection_pool_max > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool nearly exhausted"
2. Grafana dashboard panel:
{
"title": "Database Connection Pool",
"targets": [{
"expr": "db_connection_pool_active",
"legendFormat": "Active Connections"
}],
"thresholds": [
{"value": 80, "color": "yellow"},
{"value": 95, "color": "red"}
]
}
Process Improvements
1. Code review checklist updated:
- All database connections have
finally
blocks or context managers - All file handles properly closed
- All HTTP connections released
- Resource cleanup verified in error paths
2. Static analysis rule:
# Added to pylint configuration
# Warns if database connection not in try/finally or context manager
3. Integration test enhancement:
def test_sync_job_handles_validation_errors():
"""Ensure connections released even on validation errors"""
initial_pool_size = connection_pool.available_connections()
# Trigger validation error
sync_user_data(user_id_with_bad_data)
final_pool_size = connection_pool.available_connections()
# Connection should be returned to pool
assert initial_pool_size == final_pool_size
Lessons Learned
What Went Well ✓
- Fast detection - Alert fired within seconds of issue
- Quick escalation - Senior engineer joined within 3 minutes
- Effective diagnosis - Found root cause in 8 minutes using metrics
- Decisive action - Made restart decision quickly
- Good communication - Incident channel kept stakeholders informed
- Monitoring tools - Prometheus metrics crucial for diagnosis
What Went Wrong ✗
- Resource leak - Basic programming error (missing
finally
block) - Code review miss - Reviewers didn’t catch resource cleanup issue
- No preventive monitoring - Connection pool usage not alerted until too late
- Gradual failure - Silent degradation over 48 hours hard to detect
- Test gap - Integration tests only covered happy path
- Documentation - Connection pool limits not documented
Surprises 😮
- How slowly it manifested - Took 48 hours to exhaust 100 connections
- Complete failure - No graceful degradation, went from 99% to 0% instantly
- Database was fine - Initially suspected database issue, but DB was healthy
- Restart fixed it - Simple restart cleared the problem immediately
Action Items
Completed ✅
Action | Owner | Completed |
---|---|---|
Fix connection leak in sync job | Dev Team | 2025-10-14 |
Add connection pool monitoring | SRE Team | 2025-10-14 |
Deploy Prometheus alerts | SRE Team | 2025-10-14 |
Update code review checklist | Tech Lead | 2025-10-14 |
Add integration tests for error paths | QA Team | 2025-10-15 |
In Progress 🔄
Action | Owner | Target Date |
---|---|---|
Static analysis for resource leaks | Platform Team | 2025-10-20 |
Document connection pool architecture | Tech Writers | 2025-10-25 |
Review all background jobs for similar issues | Dev Team | 2025-10-30 |
Planned ⏳
Action | Owner | Target Date |
---|---|---|
Increase connection pool size (100 → 200) | DBA Team | 2025-11-01 |
Implement connection pool per service | Platform Team | 2025-11-15 |
Add automated connection leak testing | QA Team | 2025-11-30 |
Technical Deep Dive
Connection Pool Mechanics
Normal operation:
Pool Size: 100
Active: 12
Available: 88
Wait Queue: 0
During incident:
Pool Size: 100
Active: 100 (all leaked from background job)
Available: 0
Wait Queue: 247 (API requests waiting)
After restart:
Pool Size: 100
Active: 8
Available: 92
Wait Queue: 0
Why Connection Pools Matter
# Without pool: Each request creates new connection (expensive)
def handle_request():
conn = create_database_connection() # ~50ms
query_data(conn)
conn.close()
# With pool: Reuse existing connections (fast)
def handle_request():
conn = pool.get_connection() # ~1ms (if available)
query_data(conn)
pool.release_connection(conn)
Connection Lifecycle
1. Application starts → Pool creates N connections
2. Request arrives → get_connection() (if available)
3. Execute query → use connection
4. Request completes → release_connection()
5. Connection returned to pool → available for reuse
If release_connection() never called → Connection lost forever
Related Incidents
- 2024-09-12: Similar issue with Redis connection pool (different team)
- 2024-06-03: HTTP client connection leak in payment service
- 2024-03-15: File handle leak causing “too many open files” error
Appendix
Useful Commands
Check connection pool status:
# Via application metrics endpoint
curl http://localhost:8080/metrics | grep db_connection
# Direct PostgreSQL query
psql -h db.example.com -U admin -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='production';"
Find long-running connections:
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '5 minutes'
ORDER BY duration DESC;
Kill stuck connection:
SELECT pg_terminate_backend(12345); -- Replace with actual PID
References
Incident Commander: Alice Chen Contributors: Bob Martinez, Carol Davis (DBA), Dan Wilson (QA) Postmortem Completed: 2025-10-15 Next Review: 2025-11-15 (1 month follow-up)