Incident Summary

Date: 2025-09-05 Time: 09:45 UTC Duration: 1 hour 32 minutes Severity: SEV-1 (Critical) Impact: Severe performance degradation affecting 85% of users

Quick Facts

  • Users Affected: ~8,500 active users (85%)
  • Services Affected: Web Application, Mobile API, Admin Dashboard
  • Response Time: P95 latency increased from 200ms to 45 seconds
  • Revenue Impact: ~$18,000 in lost sales and abandoned carts
  • SLO Impact: 70% of monthly error budget consumed

Timeline

  • 09:45:00 - Redis cluster health check alert: Node down
  • 09:45:15 - Application latency spiked dramatically
  • 09:45:30 - PagerDuty alert: P95 latency > 10 seconds
  • 09:46:00 - On-call engineer (Sarah) acknowledged alert
  • 09:47:00 - Database CPU spiked to 95% utilization
  • 09:48:00 - Database connection pool approaching limits (180/200)
  • 09:49:00 - User complaints started flooding support channels
  • 09:50:00 - Senior SRE (Marcus) joined incident response
  • 09:52:00 - Checked Redis status: Master node unresponsive
  • 09:54:00 - Identified: Redis master failure, failover not working
  • 09:56:00 - Incident escalated to SEV-1, incident commander assigned
  • 09:58:00 - Attempted automatic failover: Failed
  • 10:00:00 - Decision: Manual promotion of Redis replica to master
  • 10:03:00 - Promoted replica-1 to master manually
  • 10:05:00 - Updated application config to point to new master
  • 10:08:00 - Rolling restart of application pods initiated
  • 10:15:00 - 50% of pods restarted with new Redis endpoint
  • 10:18:00 - Cache warming started for critical keys
  • 10:22:00 - Database load starting to decrease (CPU: 65%)
  • 10:25:00 - P95 latency improved to 3 seconds
  • 10:30:00 - All pods restarted, cache rebuild in progress
  • 10:40:00 - P95 latency down to 800ms
  • 10:50:00 - Cache fully populated, metrics returning to normal
  • 11:05:00 - P95 latency at 220ms (near baseline)
  • 11:17:00 - Incident marked as resolved
  • 11:30:00 - Post-incident monitoring confirmed stability

Root Cause Analysis

What Happened

The production Redis cluster consisted of 1 master and 2 replicas running Redis Sentinel for high availability. On September 5th at 09:45 UTC, the Redis master node experienced a kernel panic due to an underlying infrastructure issue.

The failure cascade:

  1. 09:45:00 - Redis master node crashed (kernel panic)
  2. 09:45:05 - Redis Sentinel detected master failure
  3. 09:45:10 - Sentinel attempted automatic failover
  4. 09:45:15 - Failover failed due to misconfigured quorum
  5. 09:45:15 - Applications lost Redis connection
  6. 09:45:20 - All cache misses โ†’ Direct database queries
  7. 09:45:30 - Database overwhelmed with 50x normal query load

Redis Sentinel Misconfiguration

The problem:

# Redis Sentinel configuration (INCORRECT)
sentinel monitor mymaster redis-master 6379 3
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000

# Number of Sentinel nodes: 3
# Quorum configured: 3
# Sentinels needed for failover: 3
# Problem: If any Sentinel node has issues, failover impossible!

Why failover failed:

Sentinel Cluster State at 09:45:
โ”œโ”€ sentinel-1: โœ“ Detected master down, voted for failover
โ”œโ”€ sentinel-2: โœ“ Detected master down, voted for failover
โ””โ”€ sentinel-3: โœ— Network partition, couldn't vote

Quorum required: 3/3
Quorum achieved: 2/3
Result: FAILOVER BLOCKED

Cascade Effect

Normal operation (with cache):

User Request โ†’ App Server โ†’ Redis Cache (hit) โ†’ Return in 5ms
Database queries: ~100/sec

During incident (without cache):

User Request โ†’ App Server โ†’ Redis (failed) โ†’ Database โ†’ Return in 8000ms
Database queries: ~5000/sec (50x increase)

Why It Got So Bad

  1. No cache fallback - Application had no graceful degradation
  2. Cache-aside pattern - Every miss resulted in database query
  3. No circuit breaker - Kept hammering failed Redis
  4. Thundering herd - All requests simultaneously hitting database
  5. No rate limiting - Database overwhelmed instantly

Immediate Fix

Step 1: Identify Failed Master

# Check Redis Sentinel status
redis-cli -p 26379 sentinel masters

# Output showed master as "s_down" (subjectively down)
# But no failover had occurred

# Check Sentinel logs
kubectl logs redis-sentinel-0 -n cache | tail -50

# Output:
# "Could not reach quorum for failover"
# "3/3 Sentinels required, only 2/3 available"

Step 2: Manual Failover

# Manually promote replica-1 to master
redis-cli -p 26379 SENTINEL FAILOVER mymaster

# Verify promotion
redis-cli -h redis-replica-1 -p 6379 INFO replication

# Output:
# role:master
# connected_slaves:1

Result: New master elected, but applications still pointing to old endpoint

Step 3: Update Application Configuration

# Option 1: Update ConfigMap with new Redis endpoint
kubectl patch configmap redis-config -n production \
  -p '{"data":{"REDIS_HOST":"redis-replica-1"}}'

# Option 2: Use Redis Sentinel in application
# (Better long-term solution, requires code change)

# Rolling restart to pick up new config
kubectl rollout restart deployment/web-app -n production
kubectl rollout restart deployment/mobile-api -n production

Step 4: Cache Warming

# Emergency cache warming script
import redis
import psycopg2

redis_client = redis.Redis(host='redis-replica-1', port=6379)
db_conn = psycopg2.connect("dbname=production")

# Warm critical keys
critical_queries = [
    "SELECT * FROM products WHERE featured = true",
    "SELECT * FROM categories WHERE active = true",
    "SELECT * FROM config WHERE key = 'site_settings'"
]

for query in critical_queries:
    cursor = db_conn.cursor()
    cursor.execute(query)
    results = cursor.fetchall()

    # Cache for 1 hour
    key = f"cache:{hash(query)}"
    redis_client.setex(key, 3600, json.dumps(results))
    print(f"Warmed cache key: {key}")

Result: Critical data cached, reducing database load

Long-term Prevention

1. Fix Sentinel Configuration

Corrected configuration:

# Redis Sentinel configuration (CORRECT)
sentinel monitor mymaster redis-master 6379 2  # Changed from 3 to 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000

# Why quorum = 2 works better:
# - 3 Sentinels deployed
# - Quorum: 2/3 (majority)
# - If 1 Sentinel fails, failover still works
# - If 2 Sentinels fail, cluster has bigger problems

Applied to all Sentinel instances:

# Update Sentinel ConfigMap
kubectl create configmap redis-sentinel-config \
  --from-file=sentinel.conf \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart Sentinels to apply new config
kubectl rollout restart statefulset/redis-sentinel -n cache

2. Application-Level Improvements

Implement circuit breaker pattern:

from pybreaker import CircuitBreaker
import redis
import logging

logger = logging.getLogger(__name__)

# Circuit breaker for Redis
redis_breaker = CircuitBreaker(
    fail_max=5,           # Open circuit after 5 failures
    timeout_duration=30,  # Try again after 30 seconds
    name='redis_cache'
)

class CacheService:
    def __init__(self):
        self.redis_client = redis.Redis(
            host='redis-sentinel',
            port=26379,
            sentinel_kwargs={'service_name': 'mymaster'}
        )

    @redis_breaker
    def get(self, key):
        """Get from cache with circuit breaker"""
        try:
            return self.redis_client.get(key)
        except redis.ConnectionError as e:
            logger.warning(f"Redis connection failed: {e}")
            raise

    def get_with_fallback(self, key, fallback_fn):
        """Get from cache with automatic fallback"""
        try:
            # Try cache first
            cached = self.get(key)
            if cached:
                return cached
        except Exception as e:
            logger.error(f"Cache error: {e}, using fallback")

        # Cache miss or error - use fallback
        return fallback_fn()

# Usage
cache = CacheService()

def get_user(user_id):
    def fetch_from_db():
        return db.query("SELECT * FROM users WHERE id = %s", user_id)

    return cache.get_with_fallback(f"user:{user_id}", fetch_from_db)

Use Redis Sentinel in application:

from redis.sentinel import Sentinel

# Connect via Sentinel (automatic failover)
sentinel = Sentinel([
    ('sentinel-0', 26379),
    ('sentinel-1', 26379),
    ('sentinel-2', 26379)
], socket_timeout=0.5)

# Get master connection (auto-updates on failover)
master = sentinel.master_for(
    'mymaster',
    socket_timeout=0.5,
    db=0
)

# Get replica connection (for read-only operations)
replica = sentinel.slave_for(
    'mymaster',
    socket_timeout=0.5,
    db=0
)

# Use in application
def get_cached_data(key):
    try:
        # Try master first
        return master.get(key)
    except redis.ConnectionError:
        # Fallback to replica for reads
        return replica.get(key)

3. Monitoring and Alerting

Redis health monitoring:

# Prometheus alert rules
groups:
  - name: redis_alerts
    rules:
      - alert: RedisMasterDown
        expr: redis_up{role="master"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Redis master is down"
          description: "Redis master {{ $labels.instance }} has been down for 30s"

      - alert: RedisSentinelQuorumLost
        expr: redis_sentinel_masters_sentinels < 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis Sentinel quorum lost"
          description: "Only {{ $value }} Sentinels available for master monitoring"

      - alert: RedisReplicationBroken
        expr: redis_connected_slaves == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis replication broken"
          description: "Master has no connected replicas for 5 minutes"

      - alert: RedisHighMemoryUsage
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage >90%"

      - alert: CacheHitRateLow
        expr: |
          rate(redis_keyspace_hits_total[5m]) /
          (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
          < 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 80%"
          description: "Current hit rate: {{ $value | humanizePercentage }}"

Database load monitoring:

- alert: DatabaseOverloaded
  expr: pg_stat_activity_count > 150
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Database connection count very high"
    description: "{{ $value }} active connections (normal: ~50)"

4. Infrastructure Improvements

Deploy Redis Cluster instead of Sentinel:

# Redis Cluster provides better HA and automatic sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cluster
spec:
  serviceName: redis-cluster
  replicas: 6  # 3 masters + 3 replicas
  template:
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        command:
          - redis-server
          - /conf/redis.conf
          - --cluster-enabled yes
          - --cluster-config-file /data/nodes.conf
          - --cluster-node-timeout 5000
          - --appendonly yes
        ports:
        - containerPort: 6379
          name: client
        - containerPort: 16379
          name: gossip
        volumeMounts:
        - name: conf
          mountPath: /conf
        - name: data
          mountPath: /data

Benefits of Redis Cluster:

  • Automatic failover (no Sentinel needed)
  • Horizontal scaling via sharding
  • Better split-brain protection
  • No single point of failure

5. Caching Strategy Improvements

Multi-level caching:

from functools import lru_cache
import redis
import hashlib

class MultiLevelCache:
    def __init__(self):
        self.redis = redis.Redis()
        self.local_cache = {}
        self.max_local_cache_size = 1000

    def get(self, key):
        # Level 1: In-memory cache (fastest)
        if key in self.local_cache:
            return self.local_cache[key]

        # Level 2: Redis cache
        try:
            value = self.redis.get(key)
            if value:
                # Populate local cache
                if len(self.local_cache) < self.max_local_cache_size:
                    self.local_cache[key] = value
                return value
        except redis.ConnectionError:
            logger.warning("Redis unavailable, using local cache only")

        # Level 3: Database (slowest)
        return None

    def set(self, key, value, ttl=3600):
        # Set in both caches
        self.local_cache[key] = value

        try:
            self.redis.setex(key, ttl, value)
        except redis.ConnectionError:
            logger.warning("Redis unavailable, only local cache updated")

Lessons Learned

What Went Well โœ“

  1. Fast detection - Alert fired within 15 seconds of failure
  2. Good escalation - SEV-1 declared appropriately
  3. Manual failover worked - Successfully promoted replica
  4. Cache warming - Proactive cache population reduced recovery time
  5. Team coordination - Clear communication during incident
  6. Database held up - PostgreSQL handled spike without crashing

What Went Wrong โœ—

  1. Sentinel misconfigured - Quorum set too high for 3-node cluster
  2. No automatic failover - HA system failed when needed most
  3. No circuit breaker - Application kept hammering failed Redis
  4. Thundering herd - All requests hit database simultaneously
  5. Hard Redis dependency - No graceful degradation
  6. Cache warming not automated - Manual process during incident
  7. Monitoring gap - No alert for Sentinel quorum issues

Surprises ๐Ÿ˜ฎ

  1. How fast it cascaded - Redis failure โ†’ database overload in 15 seconds
  2. Database resilience - PostgreSQL handled 50x load without crashing
  3. Sentinel failure - HA solution became single point of failure
  4. User impact - 45-second response times = 60% cart abandonment
  5. Cache hit rate matters - 95% hit rate kept database load manageable normally

Action Items

Completed โœ…

ActionOwnerCompleted
Manual Redis failoverSRE Team2025-09-05
Fix Sentinel quorum configurationSRE Team2025-09-05
Add Redis circuit breakerDev Team2025-09-06
Implement cache warming automationDev Team2025-09-06
Add Sentinel quorum monitoringSRE Team2025-09-06

In Progress ๐Ÿ”„

ActionOwnerTarget Date
Migrate to Redis Sentinel client libraryDev Team2025-09-15
Implement multi-level cachingDev Team2025-09-20
Deploy Redis Cluster (replace Sentinel)Platform Team2025-10-01

Planned โณ

ActionOwnerTarget Date
Add database query rate limitingDev Team2025-09-30
Implement cache preloading on deploymentDevOps Team2025-10-15
Chaos testing: Random Redis failuresSRE Team2025-11-01

Technical Deep Dive

Redis Sentinel vs Redis Cluster

Redis Sentinel (what we had):

Architecture:
โ”œโ”€ 1 Master (read/write)
โ”œโ”€ 2 Replicas (read-only)
โ””โ”€ 3 Sentinels (monitoring)

Pros:
- Simple setup
- Good for small deployments
- Automatic failover (when configured correctly!)

Cons:
- Single point of write contention
- Sentinels add complexity
- Quorum configuration tricky

Redis Cluster (migrating to):

Architecture:
โ”œโ”€ 3 Masters (sharded data)
โ”œโ”€ 3 Replicas (1 per master)
โ””โ”€ Built-in cluster management (no Sentinel)

Pros:
- Automatic sharding
- Better scalability
- No separate Sentinel nodes
- Automatic failover

Cons:
- More complex client library
- Multi-key operations limited
- Requires at least 6 nodes

Cache Invalidation Strategies

Time-based expiration (TTL):

# Simple but can serve stale data
redis.setex("user:123", 3600, user_data)  # 1 hour TTL

Event-driven invalidation:

# Invalidate on database write
def update_user(user_id, new_data):
    db.update_user(user_id, new_data)
    redis.delete(f"user:{user_id}")  # Invalidate cache

Cache stampede prevention:

import random

def get_with_jitter(key, ttl=3600):
    """Add random jitter to TTL to prevent thundering herd"""
    jitter = random.randint(0, int(ttl * 0.1))  # ยฑ10%
    redis.setex(key, ttl + jitter, value)

Calculating Cache Hit Rate

Hit Rate = Hits / (Hits + Misses)

Our metrics:
- Normal: 95% hit rate (19 hits per 1 miss)
- During incident: 0% hit rate (all misses)

Impact:
- Normal: 100 req/sec ร— 5% miss = 5 DB queries/sec
- Incident: 100 req/sec ร— 100% miss = 100 DB queries/sec (20x)

With 50 app servers:
- Normal: 250 DB queries/sec
- Incident: 5,000 DB queries/sec (20x increase)

Appendix

Useful Commands

Check Redis Sentinel status:

# Connect to Sentinel
redis-cli -h sentinel-0 -p 26379

# Check master status
SENTINEL masters

# Check replicas
SENTINEL replicas mymaster

# Check Sentinels
SENTINEL sentinels mymaster

# Manual failover
SENTINEL FAILOVER mymaster

Check Redis replication:

# On master
redis-cli INFO replication

# Output:
# role:master
# connected_slaves:2
# slave0:ip=10.0.1.5,port=6379,state=online
# slave1:ip=10.0.1.6,port=6379,state=online

Monitor Redis in real-time:

# Watch all commands
redis-cli MONITOR

# Check slow queries
redis-cli SLOWLOG GET 10

# Memory usage
redis-cli INFO memory

Test cache performance:

# Benchmark
redis-benchmark -h redis-master -p 6379 -c 50 -n 100000

# Results show:
# SET: ~50,000 requests/sec
# GET: ~80,000 requests/sec

References


Incident Commander: Marcus Johnson Contributors: Sarah Williams (On-call), Kevin Park (DBA), Lisa Zhang (Dev Lead) Postmortem Completed: 2025-09-06 Next Review: 2025-10-06 (1 month follow-up)