Incident Summary
Date: 2025-08-15 Time: 08:00 UTC Duration: 1 hour 45 minutes Severity: SEV-1 (Critical) Impact: Complete service unavailability for all users
Quick Facts
- Users Affected: 100% - all external traffic
- Services Affected: All public-facing services
- Revenue Impact: ~$12,000 in lost sales
- SLO Impact: 80% of monthly error budget consumed in single incident
Timeline
- 08:00:00 - SSL certificate expired (not detected)
- 08:00:30 - User reports started coming in: “Your connection is not private”
- 08:02:00 - PagerDuty alert: Health check failures from external monitoring
- 08:02:30 - On-call engineer (Sarah) acknowledged alert
- 08:03:00 - Opened website, saw SSL certificate error
- 08:03:30 - Checked certificate expiry: Expired at 08:00 UTC
- 08:04:00 - Root cause identified: SSL certificate expired
- 08:04:30 - Incident escalated to SEV-1, incident commander assigned
- 08:05:00 - Senior SRE (Mike) joined as incident commander
- 08:06:00 - Attempted automatic renewal with certbot: Failed - rate limit exceeded
- 08:08:00 - Checked Let’s Encrypt rate limits: Hit weekly renewal limit
- 08:10:00 - Decision: Use backup certificate from 6 months ago (still valid)
- 08:12:00 - Located backup certificate in secure storage
- 08:15:00 - Deployed backup certificate to load balancer
- 08:18:00 - Certificate updated, but services still showing errors
- 08:20:00 - Discovered cached certificate in CDN (Cloudflare)
- 08:22:00 - Purged Cloudflare cache
- 08:25:00 - Still seeing errors from some users
- 08:27:00 - Realized nginx not reloaded after certificate update
- 08:30:00 - Reloaded nginx on all load balancers
- 08:33:00 - Service partially restored, some users still affected
- 08:35:00 - Identified browser certificate caching
- 08:38:00 - Communicated workaround to users (clear browser cache)
- 08:45:00 - Traffic gradually recovering
- 09:00:00 - 90% of users able to access site
- 09:30:00 - 98% recovery, remaining issues browser caching
- 09:45:00 - Incident marked as resolved
Root Cause Analysis
What Happened
Primary cause: SSL certificate for *.example.com
expired at 08:00 UTC on August 15th, 2025.
Why auto-renewal failed:
- Certbot cron job was configured to run at 02:00 UTC daily
- Last successful renewal: July 1st, 2025
- Renewal attempts after July 1st: All failed silently
- Failure reason: Hit Let’s Encrypt rate limit (5 renewals per week per domain)
Why rate limit was hit:
# Certificate renewal attempts in July:
July 2: Failed (testing new certbot version)
July 3: Failed (testing new certbot version)
July 4: Failed (testing new certbot version)
July 5: Failed (testing new certbot version)
July 6: Failed (testing new certbot version)
July 7: RATE LIMIT REACHED
# All subsequent attempts failed with rate limit error
Why failures went unnoticed:
- No alerting on certbot failures
- Logs not monitored - Failure logs went to
/var/log/letsencrypt
- No expiry monitoring - No alert for certificates <30 days to expiry
- Silent failure - Cron job returned exit code 0 even on failure
Certificate Lifecycle
July 1: Certificate renewed successfully (expires Aug 15)
July 2-6: Testing new certbot โ 5 failed renewals
July 7+: Rate limited, all renewals fail silently
Aug 1: 14 days to expiry - No alert
Aug 8: 7 days to expiry - No alert
Aug 14: 1 day to expiry - No alert
Aug 15 08:00: CERTIFICATE EXPIRED โ Complete outage
Immediate Fix
Step 1: Attempted Automatic Renewal (Failed)
# Tried automatic renewal
sudo certbot renew --force-renewal
# Output:
# Error: too many certificates already issued for exact set of domains
# See https://letsencrypt.org/docs/rate-limits/
# Renewal failed
Why this failed: Let’s Encrypt rate limit (5 duplicate certificates per week)
Step 2: Deploy Backup Certificate
# Retrieved backup certificate from secure storage
aws s3 cp s3://cert-backups/wildcard-example-com-20250201.pem .
# Updated nginx configuration
sudo cp wildcard-example-com-20250201.pem /etc/nginx/ssl/example.com.crt
sudo cp wildcard-example-com-20250201-key.pem /etc/nginx/ssl/example.com.key
# Reload nginx
sudo nginx -t # Test configuration
sudo systemctl reload nginx
Step 3: Clear CDN Cache
# Purge Cloudflare cache (API call)
curl -X POST "https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/purge_cache" \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "Content-Type: application/json" \
--data '{"purge_everything":true}'
Step 4: Verify and Monitor
# Check certificate expiry
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Output:
# notBefore=Feb 1 00:00:00 2025 GMT
# notAfter=May 1 23:59:59 2025 GMT # Valid for 3 more months
Long-term Prevention
Automated Certificate Management with cert-manager
Deployed to Kubernetes (2025-08-16):
# Install cert-manager
apiVersion: v1
kind: Namespace
metadata:
name: cert-manager
---
# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
---
# Certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-com-tls
namespace: production
spec:
secretName: example-com-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- example.com
- "*.example.com"
renewBefore: 720h # Renew 30 days before expiry
Benefits:
- Automatic renewal 30 days before expiry
- No rate limiting issues (renewals spread over time)
- Certificate stored in Kubernetes secrets
- Automatic deployment to ingress controllers
Monitoring and Alerting
1. Certificate expiry monitoring:
# Prometheus alert rules
groups:
- name: ssl_certificates
rules:
- alert: SSLCertificateExpiringSoon
expr: |
probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expires in <30 days"
description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
- alert: SSLCertificateExpiringCritical
expr: |
probe_ssl_earliest_cert_expiry - time() < 86400 * 7
labels:
severity: critical
annotations:
summary: "SSL certificate expires in <7 days"
- alert: SSLCertificateExpired
expr: |
probe_ssl_earliest_cert_expiry - time() < 0
labels:
severity: critical
annotations:
summary: "SSL certificate has EXPIRED"
2. Blackbox exporter for cert monitoring:
# Prometheus blackbox exporter config
modules:
http_2xx:
prober: http
timeout: 5s
http:
method: GET
fail_if_not_ssl: true
preferred_ip_protocol: "ip4"
# Scrape config
scrape_configs:
- job_name: 'ssl-expiry'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
3. Alert for renewal failures:
# Cron job with proper error handling
#!/bin/bash
# /etc/cron.daily/certbot-renew
LOGFILE="/var/log/certbot-renew.log"
certbot renew --quiet >> "$LOGFILE" 2>&1
if [ $? -ne 0 ]; then
# Send alert to PagerDuty
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "'"$PAGERDUTY_KEY"'",
"event_action": "trigger",
"payload": {
"summary": "Certbot renewal failed",
"severity": "error",
"source": "certbot-cron"
}
}'
exit 1
fi
# Check certificate expiry
EXPIRY=$(echo | openssl s_client -servername example.com \
-connect example.com:443 2>/dev/null | \
openssl x509 -noout -enddate | cut -d= -f2)
echo "Certificate expires: $EXPIRY" >> "$LOGFILE"
Process Improvements
1. Certificate inventory:
# Created script to inventory all certificates
#!/bin/bash
# check-all-certs.sh
DOMAINS=(
"example.com"
"api.example.com"
"admin.example.com"
"cdn.example.com"
)
for domain in "${DOMAINS[@]}"; do
echo "Checking $domain..."
echo | openssl s_client -servername "$domain" \
-connect "$domain":443 2>/dev/null | \
openssl x509 -noout -subject -dates -issuer
echo "---"
done
2. Documentation:
Created runbook: Certificate Renewal Procedures
- How to check certificate status
- Manual renewal procedure
- Emergency backup certificate deployment
- Rate limit troubleshooting
3. Backup certificates:
# Automated backup to S3
#!/bin/bash
# Runs daily, backs up all certificates
DATE=$(date +%Y%m%d)
# Copy current certificates
cp /etc/letsencrypt/archive/example.com/fullchain*.pem \
/tmp/cert-backup-$DATE.pem
# Upload to S3
aws s3 cp /tmp/cert-backup-$DATE.pem \
s3://cert-backups/example-com-$DATE.pem \
--storage-class STANDARD_IA
# Verify backup
aws s3 ls s3://cert-backups/ | tail -5
Lessons Learned
What Went Well โ
- Quick identification - Found cause in 3 minutes
- Had backup certificate - Backup saved us from longer outage
- Good incident command - Clear leadership and communication
- User communication - Updated status page promptly
- Fast resolution - Restored service in under 2 hours
What Went Wrong โ
- No expiry monitoring - Certificate expired without warning
- Silent renewal failures - Certbot failures went unnoticed for 45 days
- Rate limiting - Hit Let’s Encrypt rate limit during testing
- Manual process - Certificate renewal depended on cron job
- No testing - Never tested backup certificate deployment procedure
- Cache issues - Didn’t anticipate CDN and browser caching
- Documentation gap - No runbook for certificate emergencies
Surprises ๐ฎ
- How fast users noticed - Reports within 30 seconds
- Cache complications - CDN and browser caching prolonged incident
- Rate limits bite hard - Testing in production hit rate limits
- Backup saved us - 6-month-old backup cert was still valid
- nginx reload required - Expected hot reload, but manual reload needed
Action Items
Completed โ
Action | Owner | Completed |
---|---|---|
Deploy backup certificate | SRE Team | 2025-08-15 |
Install cert-manager in Kubernetes | SRE Team | 2025-08-16 |
Migrate to cert-manager managed certificates | SRE Team | 2025-08-17 |
Add Prometheus certificate monitoring | SRE Team | 2025-08-17 |
Create certificate renewal runbook | Tech Writers | 2025-08-18 |
In Progress ๐
Action | Owner | Target Date |
---|---|---|
Audit all certificates across infrastructure | Security Team | 2025-08-25 |
Implement certificate inventory dashboard | SRE Team | 2025-08-30 |
Set up automated backup process | Platform Team | 2025-09-01 |
Planned โณ
Action | Owner | Target Date |
---|---|---|
Move all services to automated cert management | SRE Team | 2025-09-15 |
Quarterly certificate expiry drill | SRE Team | 2025-11-15 |
Implement ACME DNS-01 challenge for wildcard certs | Platform Team | 2025-10-01 |
Technical Deep Dive
SSL/TLS Certificate Lifecycle
Certificate Creation:
โโ Generate CSR (Certificate Signing Request)
โโ Submit to CA (Certificate Authority)
โโ Validate domain ownership
โโ Receive signed certificate
โโ Install certificate on servers
Certificate Validity:
โโ Not Before: Start date
โโ Not After: Expiry date (typically 90 days for Let's Encrypt)
โโ Renewal Window: Usually 30 days before expiry
Certificate Renewal:
โโ Automated (cert-manager, certbot)
โโ Manual (emergency only)
โโ Backup certificates (for emergencies)
Let’s Encrypt Rate Limits
Certificates per Registered Domain: 50 per week
โโ example.com can issue 50 certs per week
โโ Includes all subdomains
Duplicate Certificate Limit: 5 per week
โโ Same exact set of domains
โโ THIS is what we hit during testing
Failed Validation Limit: 5 per account per hostname per hour
Certificate Validation Methods
HTTP-01 Challenge:
1. Let's Encrypt provides token
2. Place token at http://example.com/.well-known/acme-challenge/{token}
3. Let's Encrypt validates by fetching the URL
4. Certificate issued
DNS-01 Challenge:
1. Let's Encrypt provides token
2. Create TXT record at _acme-challenge.example.com
3. Let's Encrypt validates DNS record
4. Certificate issued
5. Advantage: Works for wildcard certificates
Appendix
Useful Commands
Check certificate expiry:
# Quick check
curl -vI https://example.com 2>&1 | grep -i expire
# Detailed check
echo | openssl s_client -servername example.com \
-connect example.com:443 2>/dev/null | \
openssl x509 -noout -dates -subject -issuer
Test certificate:
# Test locally before deploying
openssl s_client -connect example.com:443 -servername example.com < /dev/null
# Verify certificate chain
openssl verify -CAfile chain.pem certificate.pem
Check all certificates on system:
# Find all certificate files
find /etc -name "*.pem" -o -name "*.crt" 2>/dev/null
# Check each one
for cert in /etc/ssl/certs/*.pem; do
echo "=== $cert ==="
openssl x509 -in "$cert" -noout -enddate 2>/dev/null
done
External Tools
- SSL Labs: https://www.ssllabs.com/ssltest/ - Comprehensive SSL testing
- Certificate Transparency Logs: https://crt.sh/ - See all issued certificates
- Let’s Encrypt Status: https://letsencrypt.status.io/ - Check for outages
References
Incident Commander: Mike Johnson Contributors: Sarah Williams (On-call), Tom Anderson (Security), Lisa Chen (SRE) Postmortem Completed: 2025-08-16 Next Review: 2025-09-16 (1 month follow-up)