Incident Summary

Date: 2025-10-14 Time: 03:15 UTC Duration: 23 minutes Severity: SEV-1 (Critical) Impact: Complete API unavailability affecting 100% of users

Quick Facts

  • Users Affected: ~2,000 active users
  • Services Affected: API, Admin Dashboard, Mobile App
  • Revenue Impact: ~$4,500 in lost transactions
  • SLO Impact: Consumed 45% of monthly error budget

Timeline

  • 03:15:00 - PagerDuty alert fired: API health check failures
  • 03:15:30 - On-call engineer (Alice) acknowledged alert
  • 03:16:00 - Initial investigation: All API pods showing healthy status
  • 03:17:00 - Checked application logs: “connection timeout” errors appearing
  • 03:18:00 - Senior engineer (Bob) joined incident response
  • 03:19:00 - Identified pattern: All database connection attempts timing out
  • 03:20:00 - Checked database status: PostgreSQL running normally
  • 03:22:00 - Checked connection pool metrics: 100/100 connections in use
  • 03:23:00 - Root cause identified: Background job leaking connections
  • 03:25:00 - Decision made to restart API pods to release connections
  • 03:27:00 - Rolling restart initiated for API deployment
  • 03:30:00 - First pods restarted, connection pool draining
  • 03:33:00 - 50% of pods restarted, API partially operational
  • 03:35:00 - All pods restarted, connection pool normalized
  • 03:36:00 - Smoke tests passed, API fully operational
  • 03:38:00 - Incident marked as resolved
  • 03:45:00 - Post-incident monitoring confirmed stability

Root Cause Analysis

What Happened

The API service uses a PostgreSQL connection pool configured with a maximum of 100 connections. A background job for data synchronization was deployed on October 12th (2 days prior to incident).

The background job’s error handling code had a critical flaw:

# Problematic code
def sync_user_data(user_id):
    try:
        conn = connection_pool.get_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        # Process data...
    except DataValidationError as e:
        logger.error(f"Validation failed: {e}")
        return  # Connection never released!

The flaw: When a DataValidationError occurred (malformed data), the exception was caught but the database connection was never returned to the pool.

Chain of Events

  1. Day 1-2: Background job ran successfully for valid data
  2. Day 2 evening: Corrupted user data introduced to database
  3. Day 2-3: Background job encountered validation errors
  4. Over 48 hours: Leaked connections accumulated (1-2 per hour)
  5. 03:15 UTC: Last available connection consumed
  6. 03:15-03:38: All new API requests timing out waiting for connections

Contributing Factors

  1. Missing finally block - No guarantee connection release
  2. No connection pool monitoring - No alerts before exhaustion
  3. Code review miss - Resource cleanup not verified
  4. Gradual degradation - Took 48 hours to manifest, hard to correlate
  5. Test gap - Integration tests didn’t cover error paths with real connections

Immediate Fix

Actions Taken

1. Emergency mitigation (03:25 UTC)

# Rolling restart to release connections
kubectl rollout restart deployment/api-service -n production

# Monitor restart progress
kubectl rollout status deployment/api-service -n production

2. Verify recovery (03:36 UTC)

# Check connection pool metrics
kubectl exec -it api-service-pod-xyz -- \
  curl localhost:8080/metrics | grep db_connections_active

# Result: 12/100 connections (healthy)

3. Stop background job (03:40 UTC)

# Prevent recurrence until fixed
kubectl scale deployment/sync-job -n production --replicas=0

Why This Worked

  • Restarting pods killed all connections (including leaked ones)
  • Fresh pods started with empty connection pools
  • Normal API traffic uses connections correctly
  • Stopping background job prevented new leaks

Long-term Prevention

Code Fix

Fixed background job (deployed 2025-10-14 10:00 UTC):

def sync_user_data(user_id):
    conn = None
    try:
        conn = connection_pool.get_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        # Process data...
    except DataValidationError as e:
        logger.error(f"Validation failed for user {user_id}: {e}")
        return
    finally:
        # ALWAYS release connection
        if conn:
            connection_pool.release_connection(conn)

Better: Use context manager (Python best practice):

def sync_user_data(user_id):
    try:
        with connection_pool.get_connection() as conn:
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
            # Process data...
            # Connection auto-released when exiting 'with' block
    except DataValidationError as e:
        logger.error(f"Validation failed for user {user_id}: {e}")
        return

Monitoring Improvements

1. Connection pool alerts (deployed 2025-10-14 12:00 UTC):

# Prometheus alert rules
groups:
  - name: database
    rules:
      - alert: DatabaseConnectionPoolHigh
        expr: db_connection_pool_active / db_connection_pool_max > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Connection pool usage >80%"

      - alert: DatabaseConnectionPoolCritical
        expr: db_connection_pool_active / db_connection_pool_max > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool nearly exhausted"

2. Grafana dashboard panel:

{
  "title": "Database Connection Pool",
  "targets": [{
    "expr": "db_connection_pool_active",
    "legendFormat": "Active Connections"
  }],
  "thresholds": [
    {"value": 80, "color": "yellow"},
    {"value": 95, "color": "red"}
  ]
}

Process Improvements

1. Code review checklist updated:

  • All database connections have finally blocks or context managers
  • All file handles properly closed
  • All HTTP connections released
  • Resource cleanup verified in error paths

2. Static analysis rule:

# Added to pylint configuration
# Warns if database connection not in try/finally or context manager

3. Integration test enhancement:

def test_sync_job_handles_validation_errors():
    """Ensure connections released even on validation errors"""
    initial_pool_size = connection_pool.available_connections()

    # Trigger validation error
    sync_user_data(user_id_with_bad_data)

    final_pool_size = connection_pool.available_connections()

    # Connection should be returned to pool
    assert initial_pool_size == final_pool_size

Lessons Learned

What Went Well ✓

  1. Fast detection - Alert fired within seconds of issue
  2. Quick escalation - Senior engineer joined within 3 minutes
  3. Effective diagnosis - Found root cause in 8 minutes using metrics
  4. Decisive action - Made restart decision quickly
  5. Good communication - Incident channel kept stakeholders informed
  6. Monitoring tools - Prometheus metrics crucial for diagnosis

What Went Wrong ✗

  1. Resource leak - Basic programming error (missing finally block)
  2. Code review miss - Reviewers didn’t catch resource cleanup issue
  3. No preventive monitoring - Connection pool usage not alerted until too late
  4. Gradual failure - Silent degradation over 48 hours hard to detect
  5. Test gap - Integration tests only covered happy path
  6. Documentation - Connection pool limits not documented

Surprises 😮

  1. How slowly it manifested - Took 48 hours to exhaust 100 connections
  2. Complete failure - No graceful degradation, went from 99% to 0% instantly
  3. Database was fine - Initially suspected database issue, but DB was healthy
  4. Restart fixed it - Simple restart cleared the problem immediately

Action Items

Completed ✅

ActionOwnerCompleted
Fix connection leak in sync jobDev Team2025-10-14
Add connection pool monitoringSRE Team2025-10-14
Deploy Prometheus alertsSRE Team2025-10-14
Update code review checklistTech Lead2025-10-14
Add integration tests for error pathsQA Team2025-10-15

In Progress 🔄

ActionOwnerTarget Date
Static analysis for resource leaksPlatform Team2025-10-20
Document connection pool architectureTech Writers2025-10-25
Review all background jobs for similar issuesDev Team2025-10-30

Planned ⏳

ActionOwnerTarget Date
Increase connection pool size (100 → 200)DBA Team2025-11-01
Implement connection pool per servicePlatform Team2025-11-15
Add automated connection leak testingQA Team2025-11-30

Technical Deep Dive

Connection Pool Mechanics

Normal operation:

Pool Size: 100
Active: 12
Available: 88
Wait Queue: 0

During incident:

Pool Size: 100
Active: 100 (all leaked from background job)
Available: 0
Wait Queue: 247 (API requests waiting)

After restart:

Pool Size: 100
Active: 8
Available: 92
Wait Queue: 0

Why Connection Pools Matter

# Without pool: Each request creates new connection (expensive)
def handle_request():
    conn = create_database_connection()  # ~50ms
    query_data(conn)
    conn.close()

# With pool: Reuse existing connections (fast)
def handle_request():
    conn = pool.get_connection()  # ~1ms (if available)
    query_data(conn)
    pool.release_connection(conn)

Connection Lifecycle

1. Application starts → Pool creates N connections
2. Request arrives → get_connection() (if available)
3. Execute query → use connection
4. Request completes → release_connection()
5. Connection returned to pool → available for reuse

If release_connection() never called → Connection lost forever
  • 2024-09-12: Similar issue with Redis connection pool (different team)
  • 2024-06-03: HTTP client connection leak in payment service
  • 2024-03-15: File handle leak causing “too many open files” error

Appendix

Useful Commands

Check connection pool status:

# Via application metrics endpoint
curl http://localhost:8080/metrics | grep db_connection

# Direct PostgreSQL query
psql -h db.example.com -U admin -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='production';"

Find long-running connections:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE state != 'idle'
  AND now() - pg_stat_activity.query_start > interval '5 minutes'
ORDER BY duration DESC;

Kill stuck connection:

SELECT pg_terminate_backend(12345);  -- Replace with actual PID

References


Incident Commander: Alice Chen Contributors: Bob Martinez, Carol Davis (DBA), Dan Wilson (QA) Postmortem Completed: 2025-10-15 Next Review: 2025-11-15 (1 month follow-up)