Incident Summary
Date: 2025-09-05 Time: 09:45 UTC Duration: 1 hour 32 minutes Severity: SEV-1 (Critical) Impact: Severe performance degradation affecting 85% of users
Quick Facts
- Users Affected: ~8,500 active users (85%)
- Services Affected: Web Application, Mobile API, Admin Dashboard
- Response Time: P95 latency increased from 200ms to 45 seconds
- Revenue Impact: ~$18,000 in lost sales and abandoned carts
- SLO Impact: 70% of monthly error budget consumed
Timeline
- 09:45:00 - Redis cluster health check alert: Node down
- 09:45:15 - Application latency spiked dramatically
- 09:45:30 - PagerDuty alert: P95 latency > 10 seconds
- 09:46:00 - On-call engineer (Sarah) acknowledged alert
- 09:47:00 - Database CPU spiked to 95% utilization
- 09:48:00 - Database connection pool approaching limits (180/200)
- 09:49:00 - User complaints started flooding support channels
- 09:50:00 - Senior SRE (Marcus) joined incident response
- 09:52:00 - Checked Redis status: Master node unresponsive
- 09:54:00 - Identified: Redis master failure, failover not working
- 09:56:00 - Incident escalated to SEV-1, incident commander assigned
- 09:58:00 - Attempted automatic failover: Failed
- 10:00:00 - Decision: Manual promotion of Redis replica to master
- 10:03:00 - Promoted replica-1 to master manually
- 10:05:00 - Updated application config to point to new master
- 10:08:00 - Rolling restart of application pods initiated
- 10:15:00 - 50% of pods restarted with new Redis endpoint
- 10:18:00 - Cache warming started for critical keys
- 10:22:00 - Database load starting to decrease (CPU: 65%)
- 10:25:00 - P95 latency improved to 3 seconds
- 10:30:00 - All pods restarted, cache rebuild in progress
- 10:40:00 - P95 latency down to 800ms
- 10:50:00 - Cache fully populated, metrics returning to normal
- 11:05:00 - P95 latency at 220ms (near baseline)
- 11:17:00 - Incident marked as resolved
- 11:30:00 - Post-incident monitoring confirmed stability
Root Cause Analysis
What Happened
The production Redis cluster consisted of 1 master and 2 replicas running Redis Sentinel for high availability. On September 5th at 09:45 UTC, the Redis master node experienced a kernel panic due to an underlying infrastructure issue.
The failure cascade:
- 09:45:00 - Redis master node crashed (kernel panic)
- 09:45:05 - Redis Sentinel detected master failure
- 09:45:10 - Sentinel attempted automatic failover
- 09:45:15 - Failover failed due to misconfigured quorum
- 09:45:15 - Applications lost Redis connection
- 09:45:20 - All cache misses โ Direct database queries
- 09:45:30 - Database overwhelmed with 50x normal query load
Redis Sentinel Misconfiguration
The problem:
# Redis Sentinel configuration (INCORRECT)
sentinel monitor mymaster redis-master 6379 3
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000
# Number of Sentinel nodes: 3
# Quorum configured: 3
# Sentinels needed for failover: 3
# Problem: If any Sentinel node has issues, failover impossible!
Why failover failed:
Sentinel Cluster State at 09:45:
โโ sentinel-1: โ Detected master down, voted for failover
โโ sentinel-2: โ Detected master down, voted for failover
โโ sentinel-3: โ Network partition, couldn't vote
Quorum required: 3/3
Quorum achieved: 2/3
Result: FAILOVER BLOCKED
Cascade Effect
Normal operation (with cache):
User Request โ App Server โ Redis Cache (hit) โ Return in 5ms
Database queries: ~100/sec
During incident (without cache):
User Request โ App Server โ Redis (failed) โ Database โ Return in 8000ms
Database queries: ~5000/sec (50x increase)
Why It Got So Bad
- No cache fallback - Application had no graceful degradation
- Cache-aside pattern - Every miss resulted in database query
- No circuit breaker - Kept hammering failed Redis
- Thundering herd - All requests simultaneously hitting database
- No rate limiting - Database overwhelmed instantly
Immediate Fix
Step 1: Identify Failed Master
# Check Redis Sentinel status
redis-cli -p 26379 sentinel masters
# Output showed master as "s_down" (subjectively down)
# But no failover had occurred
# Check Sentinel logs
kubectl logs redis-sentinel-0 -n cache | tail -50
# Output:
# "Could not reach quorum for failover"
# "3/3 Sentinels required, only 2/3 available"
Step 2: Manual Failover
# Manually promote replica-1 to master
redis-cli -p 26379 SENTINEL FAILOVER mymaster
# Verify promotion
redis-cli -h redis-replica-1 -p 6379 INFO replication
# Output:
# role:master
# connected_slaves:1
Result: New master elected, but applications still pointing to old endpoint
Step 3: Update Application Configuration
# Option 1: Update ConfigMap with new Redis endpoint
kubectl patch configmap redis-config -n production \
-p '{"data":{"REDIS_HOST":"redis-replica-1"}}'
# Option 2: Use Redis Sentinel in application
# (Better long-term solution, requires code change)
# Rolling restart to pick up new config
kubectl rollout restart deployment/web-app -n production
kubectl rollout restart deployment/mobile-api -n production
Step 4: Cache Warming
# Emergency cache warming script
import redis
import psycopg2
redis_client = redis.Redis(host='redis-replica-1', port=6379)
db_conn = psycopg2.connect("dbname=production")
# Warm critical keys
critical_queries = [
"SELECT * FROM products WHERE featured = true",
"SELECT * FROM categories WHERE active = true",
"SELECT * FROM config WHERE key = 'site_settings'"
]
for query in critical_queries:
cursor = db_conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
# Cache for 1 hour
key = f"cache:{hash(query)}"
redis_client.setex(key, 3600, json.dumps(results))
print(f"Warmed cache key: {key}")
Result: Critical data cached, reducing database load
Long-term Prevention
1. Fix Sentinel Configuration
Corrected configuration:
# Redis Sentinel configuration (CORRECT)
sentinel monitor mymaster redis-master 6379 2 # Changed from 3 to 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000
# Why quorum = 2 works better:
# - 3 Sentinels deployed
# - Quorum: 2/3 (majority)
# - If 1 Sentinel fails, failover still works
# - If 2 Sentinels fail, cluster has bigger problems
Applied to all Sentinel instances:
# Update Sentinel ConfigMap
kubectl create configmap redis-sentinel-config \
--from-file=sentinel.conf \
--dry-run=client -o yaml | kubectl apply -f -
# Restart Sentinels to apply new config
kubectl rollout restart statefulset/redis-sentinel -n cache
2. Application-Level Improvements
Implement circuit breaker pattern:
from pybreaker import CircuitBreaker
import redis
import logging
logger = logging.getLogger(__name__)
# Circuit breaker for Redis
redis_breaker = CircuitBreaker(
fail_max=5, # Open circuit after 5 failures
timeout_duration=30, # Try again after 30 seconds
name='redis_cache'
)
class CacheService:
def __init__(self):
self.redis_client = redis.Redis(
host='redis-sentinel',
port=26379,
sentinel_kwargs={'service_name': 'mymaster'}
)
@redis_breaker
def get(self, key):
"""Get from cache with circuit breaker"""
try:
return self.redis_client.get(key)
except redis.ConnectionError as e:
logger.warning(f"Redis connection failed: {e}")
raise
def get_with_fallback(self, key, fallback_fn):
"""Get from cache with automatic fallback"""
try:
# Try cache first
cached = self.get(key)
if cached:
return cached
except Exception as e:
logger.error(f"Cache error: {e}, using fallback")
# Cache miss or error - use fallback
return fallback_fn()
# Usage
cache = CacheService()
def get_user(user_id):
def fetch_from_db():
return db.query("SELECT * FROM users WHERE id = %s", user_id)
return cache.get_with_fallback(f"user:{user_id}", fetch_from_db)
Use Redis Sentinel in application:
from redis.sentinel import Sentinel
# Connect via Sentinel (automatic failover)
sentinel = Sentinel([
('sentinel-0', 26379),
('sentinel-1', 26379),
('sentinel-2', 26379)
], socket_timeout=0.5)
# Get master connection (auto-updates on failover)
master = sentinel.master_for(
'mymaster',
socket_timeout=0.5,
db=0
)
# Get replica connection (for read-only operations)
replica = sentinel.slave_for(
'mymaster',
socket_timeout=0.5,
db=0
)
# Use in application
def get_cached_data(key):
try:
# Try master first
return master.get(key)
except redis.ConnectionError:
# Fallback to replica for reads
return replica.get(key)
3. Monitoring and Alerting
Redis health monitoring:
# Prometheus alert rules
groups:
- name: redis_alerts
rules:
- alert: RedisMasterDown
expr: redis_up{role="master"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Redis master is down"
description: "Redis master {{ $labels.instance }} has been down for 30s"
- alert: RedisSentinelQuorumLost
expr: redis_sentinel_masters_sentinels < 2
for: 1m
labels:
severity: critical
annotations:
summary: "Redis Sentinel quorum lost"
description: "Only {{ $value }} Sentinels available for master monitoring"
- alert: RedisReplicationBroken
expr: redis_connected_slaves == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis replication broken"
description: "Master has no connected replicas for 5 minutes"
- alert: RedisHighMemoryUsage
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage >90%"
- alert: CacheHitRateLow
expr: |
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
< 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 80%"
description: "Current hit rate: {{ $value | humanizePercentage }}"
Database load monitoring:
- alert: DatabaseOverloaded
expr: pg_stat_activity_count > 150
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection count very high"
description: "{{ $value }} active connections (normal: ~50)"
4. Infrastructure Improvements
Deploy Redis Cluster instead of Sentinel:
# Redis Cluster provides better HA and automatic sharding
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis-cluster
replicas: 6 # 3 masters + 3 replicas
template:
spec:
containers:
- name: redis
image: redis:7-alpine
command:
- redis-server
- /conf/redis.conf
- --cluster-enabled yes
- --cluster-config-file /data/nodes.conf
- --cluster-node-timeout 5000
- --appendonly yes
ports:
- containerPort: 6379
name: client
- containerPort: 16379
name: gossip
volumeMounts:
- name: conf
mountPath: /conf
- name: data
mountPath: /data
Benefits of Redis Cluster:
- Automatic failover (no Sentinel needed)
- Horizontal scaling via sharding
- Better split-brain protection
- No single point of failure
5. Caching Strategy Improvements
Multi-level caching:
from functools import lru_cache
import redis
import hashlib
class MultiLevelCache:
def __init__(self):
self.redis = redis.Redis()
self.local_cache = {}
self.max_local_cache_size = 1000
def get(self, key):
# Level 1: In-memory cache (fastest)
if key in self.local_cache:
return self.local_cache[key]
# Level 2: Redis cache
try:
value = self.redis.get(key)
if value:
# Populate local cache
if len(self.local_cache) < self.max_local_cache_size:
self.local_cache[key] = value
return value
except redis.ConnectionError:
logger.warning("Redis unavailable, using local cache only")
# Level 3: Database (slowest)
return None
def set(self, key, value, ttl=3600):
# Set in both caches
self.local_cache[key] = value
try:
self.redis.setex(key, ttl, value)
except redis.ConnectionError:
logger.warning("Redis unavailable, only local cache updated")
Lessons Learned
What Went Well โ
- Fast detection - Alert fired within 15 seconds of failure
- Good escalation - SEV-1 declared appropriately
- Manual failover worked - Successfully promoted replica
- Cache warming - Proactive cache population reduced recovery time
- Team coordination - Clear communication during incident
- Database held up - PostgreSQL handled spike without crashing
What Went Wrong โ
- Sentinel misconfigured - Quorum set too high for 3-node cluster
- No automatic failover - HA system failed when needed most
- No circuit breaker - Application kept hammering failed Redis
- Thundering herd - All requests hit database simultaneously
- Hard Redis dependency - No graceful degradation
- Cache warming not automated - Manual process during incident
- Monitoring gap - No alert for Sentinel quorum issues
Surprises ๐ฎ
- How fast it cascaded - Redis failure โ database overload in 15 seconds
- Database resilience - PostgreSQL handled 50x load without crashing
- Sentinel failure - HA solution became single point of failure
- User impact - 45-second response times = 60% cart abandonment
- Cache hit rate matters - 95% hit rate kept database load manageable normally
Action Items
Completed โ
Action | Owner | Completed |
---|---|---|
Manual Redis failover | SRE Team | 2025-09-05 |
Fix Sentinel quorum configuration | SRE Team | 2025-09-05 |
Add Redis circuit breaker | Dev Team | 2025-09-06 |
Implement cache warming automation | Dev Team | 2025-09-06 |
Add Sentinel quorum monitoring | SRE Team | 2025-09-06 |
In Progress ๐
Action | Owner | Target Date |
---|---|---|
Migrate to Redis Sentinel client library | Dev Team | 2025-09-15 |
Implement multi-level caching | Dev Team | 2025-09-20 |
Deploy Redis Cluster (replace Sentinel) | Platform Team | 2025-10-01 |
Planned โณ
Action | Owner | Target Date |
---|---|---|
Add database query rate limiting | Dev Team | 2025-09-30 |
Implement cache preloading on deployment | DevOps Team | 2025-10-15 |
Chaos testing: Random Redis failures | SRE Team | 2025-11-01 |
Technical Deep Dive
Redis Sentinel vs Redis Cluster
Redis Sentinel (what we had):
Architecture:
โโ 1 Master (read/write)
โโ 2 Replicas (read-only)
โโ 3 Sentinels (monitoring)
Pros:
- Simple setup
- Good for small deployments
- Automatic failover (when configured correctly!)
Cons:
- Single point of write contention
- Sentinels add complexity
- Quorum configuration tricky
Redis Cluster (migrating to):
Architecture:
โโ 3 Masters (sharded data)
โโ 3 Replicas (1 per master)
โโ Built-in cluster management (no Sentinel)
Pros:
- Automatic sharding
- Better scalability
- No separate Sentinel nodes
- Automatic failover
Cons:
- More complex client library
- Multi-key operations limited
- Requires at least 6 nodes
Cache Invalidation Strategies
Time-based expiration (TTL):
# Simple but can serve stale data
redis.setex("user:123", 3600, user_data) # 1 hour TTL
Event-driven invalidation:
# Invalidate on database write
def update_user(user_id, new_data):
db.update_user(user_id, new_data)
redis.delete(f"user:{user_id}") # Invalidate cache
Cache stampede prevention:
import random
def get_with_jitter(key, ttl=3600):
"""Add random jitter to TTL to prevent thundering herd"""
jitter = random.randint(0, int(ttl * 0.1)) # ยฑ10%
redis.setex(key, ttl + jitter, value)
Calculating Cache Hit Rate
Hit Rate = Hits / (Hits + Misses)
Our metrics:
- Normal: 95% hit rate (19 hits per 1 miss)
- During incident: 0% hit rate (all misses)
Impact:
- Normal: 100 req/sec ร 5% miss = 5 DB queries/sec
- Incident: 100 req/sec ร 100% miss = 100 DB queries/sec (20x)
With 50 app servers:
- Normal: 250 DB queries/sec
- Incident: 5,000 DB queries/sec (20x increase)
Appendix
Useful Commands
Check Redis Sentinel status:
# Connect to Sentinel
redis-cli -h sentinel-0 -p 26379
# Check master status
SENTINEL masters
# Check replicas
SENTINEL replicas mymaster
# Check Sentinels
SENTINEL sentinels mymaster
# Manual failover
SENTINEL FAILOVER mymaster
Check Redis replication:
# On master
redis-cli INFO replication
# Output:
# role:master
# connected_slaves:2
# slave0:ip=10.0.1.5,port=6379,state=online
# slave1:ip=10.0.1.6,port=6379,state=online
Monitor Redis in real-time:
# Watch all commands
redis-cli MONITOR
# Check slow queries
redis-cli SLOWLOG GET 10
# Memory usage
redis-cli INFO memory
Test cache performance:
# Benchmark
redis-benchmark -h redis-master -p 6379 -c 50 -n 100000
# Results show:
# SET: ~50,000 requests/sec
# GET: ~80,000 requests/sec
References
Incident Commander: Marcus Johnson Contributors: Sarah Williams (On-call), Kevin Park (DBA), Lisa Zhang (Dev Lead) Postmortem Completed: 2025-09-06 Next Review: 2025-10-06 (1 month follow-up)