Incident: Database Connection Pool Exhaustion

Incident Summary

Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users

Quick Facts

Users Affected: ~2,000 active users
Services Affected: API, Admin Dashboard, Mobile App
Revenue Impact: ~$4,500 in lost transactions
SLO Impact: Consumed 45% of monthly error budget

Timeline

03:15:00 - PagerDuty alert fired: API health check failures
03:15:30 - On-call engineer (Alice) acknowledged alert
03:16:00 - Initial investigation: All API pods showing healthy status
03:17:00 - Checked application logs: “connection timeout” errors appearing
03:18:00 - Senior engineer (Bob) joined incident response
03:19:00 - Identified pattern: All database connection attempts timing out
03:20:00 - Checked database status: PostgreSQL running normally
03:22:00 - Checked connection pool metrics: 100/100 connections in use
03:23:00 - Root cause identified: Background job leaking connections
03:25:00 - Decision made to restart API pods to release connections
03:27:00 - Rolling restart initiated for API deployment
03:30:00 - First pods restarted, connection pool draining
03:33:00 - 50% of pods restarted, API partially operational
03:35:00 - All pods restarted, connection pool normalized
03:36:00 - Smoke tests passed, API fully operational
03:38:00 - Incident marked as resolved
03:45:00 - Post-incident monitoring confirmed stability

Root Cause Analysis

What Happened

The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).

The background job’s error handling code had a critical flaw:

# Problematic code
def sync_user_data(user_id):
    try:
        conn = connection_pool.get_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        # Process data...
    except DataValidationError as e:
        logger.error(f"Validation failed: {e}")
        return  # Connection never released!

The flaw: When a DataValidationError occurred (malformed data), the exception was caught but the database connection was never returned to the pool.

Chain of Events

Day 1-2: Background job ran successfully for valid data
Day 2 evening: Corrupted user data introduced to database
Day 2-3: Background job encountered validation errors
Over 48 hours: Leaked connections accumulated (1-2 per hour)
03:15 UTC: Last available connection consumed
03:15-03:38: All new API requests timing out waiting for connections

Contributing Factors

Missing finally block - No guarantee connection release
No connection pool monitoring - No alerts before exhaustion
Code review miss - Resource cleanup not verified
Gradual degradation - Took 48 hours to manifest, hard to correlate
Test gap - Integration tests didn’t cover error paths with real connections

Immediate Fix

Actions Taken

1. Emergency mitigation (03:25 UTC)

# Rolling restart to release connections
kubectl rollout restart deployment/api-service -n production

# Monitor restart progress
kubectl rollout status deployment/api-service -n production

2. Verify recovery (03:36 UTC)

# Check connection pool metrics
kubectl exec -it api-service-pod-xyz -- \
  curl localhost:8080/metrics | grep db_connections_active

# Result: 12/100 connections (healthy)

3. Stop background job (03:40 UTC)

# Prevent recurrence until fixed
kubectl scale deployment/sync-job -n production --replicas=0

Why This Worked

Restarting pods killed all connections (including leaked ones)
Fresh pods started with empty connection pools
Normal API traffic uses connections correctly
Stopping background job prevented new leaks

Long-term Prevention

Code Fix

Fixed background job (deployed 2025-10-14 10:00 UTC):

def sync_user_data(user_id):
    conn = None
    try:
        conn = connection_pool.get_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        # Process data...
    except DataValidationError as e:
        logger.error(f"Validation failed for user {user_id}: {e}")
        return
    finally:
        # ALWAYS release connection
        if conn:
            connection_pool.release_connection(conn)

Better: Use context manager (Python best practice):

def sync_user_data(user_id):
    try:
        with connection_pool.get_connection() as conn:
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
            # Process data...
            # Connection auto-released when exiting 'with' block
    except DataValidationError as e:
        logger.error(f"Validation failed for user {user_id}: {e}")
        return

Monitoring Improvements

1. Connection pool alerts (deployed 2025-10-14 12:00 UTC):

# Prometheus alert rules
groups:
  - name: database
    rules:
      - alert: DatabaseConnectionPoolHigh
        expr: db_connection_pool_active / db_connection_pool_max > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Connection pool usage >80%"

      - alert: DatabaseConnectionPoolCritical
        expr: db_connection_pool_active / db_connection_pool_max > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool nearly exhausted"

2. Grafana dashboard panel:

{
  "title": "Database Connection Pool",
  "targets": [{
    "expr": "db_connection_pool_active",
    "legendFormat": "Active Connections"
  }],
  "thresholds": [
    {"value": 80, "color": "yellow"},
    {"value": 95, "color": "red"}
  ]
}

Process Improvements

1. Code review checklist updated:

All database connections have finally blocks or context managers
All file handles properly closed
All HTTP connections released
Resource cleanup verified in error paths

2. Static analysis rule:

# Added to pylint configuration
# Warns if database connection not in try/finally or context manager

3. Integration test enhancement:

def test_sync_job_handles_validation_errors():
    """Ensure connections released even on validation errors"""
    initial_pool_size = connection_pool.available_connections()

    # Trigger validation error
    sync_user_data(user_id_with_bad_data)

    final_pool_size = connection_pool.available_connections()

    # Connection should be returned to pool
    assert initial_pool_size == final_pool_size

Lessons Learned

What Went Well ✓

Fast detection - Alert fired within seconds of issue
Quick escalation - Senior engineer joined within 3 minutes
Effective diagnosis - Found root cause in 8 minutes using metrics
Decisive action - Made restart decision quickly
Good communication - Incident channel kept stakeholders informed
Monitoring tools - Prometheus metrics crucial for diagnosis

What Went Wrong ✗

Resource leak - Basic programming error (missing finally block)
Code review miss - Reviewers didn’t catch resource cleanup issue
No preventive monitoring - Connection pool usage not alerted until too late
Gradual failure - Silent degradation over 48 hours hard to detect
Test gap - Integration tests only covered happy path
Documentation - Connection pool limits not documented

Surprises 😮

How slowly it manifested - Took 48 hours to exhaust 100 connections
Complete failure - No graceful degradation, went from 99% to 0% instantly
Database was fine - Initially suspected database issue, but DB was healthy
Restart fixed it - Simple restart cleared the problem immediately

Action Items

Completed ✅

Action	Owner	Completed
Fix connection leak in sync job	Dev Team	2025-10-14
Add connection pool monitoring	SRE Team	2025-10-14
Deploy Prometheus alerts	SRE Team	2025-10-14
Update code review checklist	Tech Lead	2025-10-14
Add integration tests for error paths	QA Team	2025-10-15

In Progress 🔄

Action	Owner	Target Date
Static analysis for resource leaks	Platform Team	2025-10-20
Document connection pool architecture	Tech Writers	2025-10-25
Review all background jobs for similar issues	Dev Team	2025-10-30

Planned ⏳

Action	Owner	Target Date
Increase connection pool size (100 → 200)	DBA Team	2025-11-01
Implement connection pool per service	Platform Team	2025-11-15
Add automated connection leak testing	QA Team	2025-11-30

Technical Deep Dive

Connection Pool Mechanics

Normal operation:

Pool Size: 100
Active: 12
Available: 88
Wait Queue: 0

During incident:

Pool Size: 100
Active: 100 (all leaked from background job)
Available: 0
Wait Queue: 247 (API requests waiting)

After restart:

Pool Size: 100
Active: 8
Available: 92
Wait Queue: 0

Why Connection Pools Matter

# Without pool: Each request creates new connection (expensive)
def handle_request():
    conn = create_database_connection()  # ~50ms
    query_data(conn)
    conn.close()

# With pool: Reuse existing connections (fast)
def handle_request():
    conn = pool.get_connection()  # ~1ms (if available)
    query_data(conn)
    pool.release_connection(conn)

Connection Lifecycle

1. Application starts → Pool creates N connections
2. Request arrives → get_connection() (if available)
3. Execute query → use connection
4. Request completes → release_connection()
5. Connection returned to pool → available for reuse

If release_connection() never called → Connection lost forever

2024-09-12: Similar issue with Redis connection pool (different team)
2024-06-03: HTTP client connection leak in payment service
2024-03-15: File handle leak causing “too many open files” error

Appendix

Useful Commands

Check connection pool status:

# Via application metrics endpoint
curl http://localhost:8080/metrics | grep db_connection

# Direct PostgreSQL query
psql -h db.example.com -U admin -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='production';"

Find long-running connections:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE state != 'idle'
  AND now() - pg_stat_activity.query_start > interval '5 minutes'
ORDER BY duration DESC;

Kill stuck connection:

SELECT pg_terminate_backend(12345);  -- Replace with actual PID

References

Incident Commander: Alice Chen Contributors: Bob Martinez, Carol Davis (DBA), Dan Wilson (QA) Postmortem Completed: 2025-10-15 Next Review: 2025-11-15 (1 month follow-up)

Incident Summary#

Quick Facts#

Timeline#

Root Cause Analysis#

What Happened#

Chain of Events#

Contributing Factors#

Immediate Fix#

Actions Taken#

Why This Worked#

Long-term Prevention#

Code Fix#

Monitoring Improvements#

Process Improvements#

Lessons Learned#

What Went Well ✓#

What Went Wrong ✗#

Surprises 😮#

Action Items#

Completed ✅#

In Progress 🔄#

Planned ⏳#

Technical Deep Dive#

Connection Pool Mechanics#

Why Connection Pools Matter#

Connection Lifecycle#

Related Incidents#

Appendix#

Useful Commands#

References#