Incident Summary

Date: 2025-08-15 Time: 08:00 UTC Duration: 1 hour 45 minutes Severity: SEV-1 (Critical) Impact: Complete service unavailability for all users

Quick Facts

  • Users Affected: 100% - all external traffic
  • Services Affected: All public-facing services
  • Revenue Impact: ~$12,000 in lost sales
  • SLO Impact: 80% of monthly error budget consumed in single incident

Timeline

  • 08:00:00 - SSL certificate expired (not detected)
  • 08:00:30 - User reports started coming in: “Your connection is not private”
  • 08:02:00 - PagerDuty alert: Health check failures from external monitoring
  • 08:02:30 - On-call engineer (Sarah) acknowledged alert
  • 08:03:00 - Opened website, saw SSL certificate error
  • 08:03:30 - Checked certificate expiry: Expired at 08:00 UTC
  • 08:04:00 - Root cause identified: SSL certificate expired
  • 08:04:30 - Incident escalated to SEV-1, incident commander assigned
  • 08:05:00 - Senior SRE (Mike) joined as incident commander
  • 08:06:00 - Attempted automatic renewal with certbot: Failed - rate limit exceeded
  • 08:08:00 - Checked Let’s Encrypt rate limits: Hit weekly renewal limit
  • 08:10:00 - Decision: Use backup certificate from 6 months ago (still valid)
  • 08:12:00 - Located backup certificate in secure storage
  • 08:15:00 - Deployed backup certificate to load balancer
  • 08:18:00 - Certificate updated, but services still showing errors
  • 08:20:00 - Discovered cached certificate in CDN (Cloudflare)
  • 08:22:00 - Purged Cloudflare cache
  • 08:25:00 - Still seeing errors from some users
  • 08:27:00 - Realized nginx not reloaded after certificate update
  • 08:30:00 - Reloaded nginx on all load balancers
  • 08:33:00 - Service partially restored, some users still affected
  • 08:35:00 - Identified browser certificate caching
  • 08:38:00 - Communicated workaround to users (clear browser cache)
  • 08:45:00 - Traffic gradually recovering
  • 09:00:00 - 90% of users able to access site
  • 09:30:00 - 98% recovery, remaining issues browser caching
  • 09:45:00 - Incident marked as resolved

Root Cause Analysis

What Happened

Primary cause: SSL certificate for *.example.com expired at 08:00 UTC on August 15th, 2025.

Why auto-renewal failed:

  1. Certbot cron job was configured to run at 02:00 UTC daily
  2. Last successful renewal: July 1st, 2025
  3. Renewal attempts after July 1st: All failed silently
  4. Failure reason: Hit Let’s Encrypt rate limit (5 renewals per week per domain)

Why rate limit was hit:

# Certificate renewal attempts in July:
July 2:  Failed (testing new certbot version)
July 3:  Failed (testing new certbot version)
July 4:  Failed (testing new certbot version)
July 5:  Failed (testing new certbot version)
July 6:  Failed (testing new certbot version)
July 7:  RATE LIMIT REACHED
# All subsequent attempts failed with rate limit error

Why failures went unnoticed:

  1. No alerting on certbot failures
  2. Logs not monitored - Failure logs went to /var/log/letsencrypt
  3. No expiry monitoring - No alert for certificates <30 days to expiry
  4. Silent failure - Cron job returned exit code 0 even on failure

Certificate Lifecycle

July 1:  Certificate renewed successfully (expires Aug 15)
July 2-6: Testing new certbot โ†’ 5 failed renewals
July 7+: Rate limited, all renewals fail silently
Aug 1:   14 days to expiry - No alert
Aug 8:   7 days to expiry - No alert
Aug 14:  1 day to expiry - No alert
Aug 15 08:00: CERTIFICATE EXPIRED โ†’ Complete outage

Immediate Fix

Step 1: Attempted Automatic Renewal (Failed)

# Tried automatic renewal
sudo certbot renew --force-renewal

# Output:
# Error: too many certificates already issued for exact set of domains
# See https://letsencrypt.org/docs/rate-limits/
# Renewal failed

Why this failed: Let’s Encrypt rate limit (5 duplicate certificates per week)

Step 2: Deploy Backup Certificate

# Retrieved backup certificate from secure storage
aws s3 cp s3://cert-backups/wildcard-example-com-20250201.pem .

# Updated nginx configuration
sudo cp wildcard-example-com-20250201.pem /etc/nginx/ssl/example.com.crt
sudo cp wildcard-example-com-20250201-key.pem /etc/nginx/ssl/example.com.key

# Reload nginx
sudo nginx -t  # Test configuration
sudo systemctl reload nginx

Step 3: Clear CDN Cache

# Purge Cloudflare cache (API call)
curl -X POST "https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/purge_cache" \
  -H "Authorization: Bearer ${CF_TOKEN}" \
  -H "Content-Type: application/json" \
  --data '{"purge_everything":true}'

Step 4: Verify and Monitor

# Check certificate expiry
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Output:
# notBefore=Feb  1 00:00:00 2025 GMT
# notAfter=May  1 23:59:59 2025 GMT  # Valid for 3 more months

Long-term Prevention

Automated Certificate Management with cert-manager

Deployed to Kubernetes (2025-08-16):

# Install cert-manager
apiVersion: v1
kind: Namespace
metadata:
  name: cert-manager

---
# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
    - http01:
        ingress:
          class: nginx

---
# Certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com-tls
  namespace: production
spec:
  secretName: example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - example.com
  - "*.example.com"
  renewBefore: 720h  # Renew 30 days before expiry

Benefits:

  • Automatic renewal 30 days before expiry
  • No rate limiting issues (renewals spread over time)
  • Certificate stored in Kubernetes secrets
  • Automatic deployment to ingress controllers

Monitoring and Alerting

1. Certificate expiry monitoring:

# Prometheus alert rules
groups:
  - name: ssl_certificates
    rules:
      - alert: SSLCertificateExpiringSoon
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expires in <30 days"
          description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

      - alert: SSLCertificateExpiringCritical
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 86400 * 7
        labels:
          severity: critical
        annotations:
          summary: "SSL certificate expires in <7 days"

      - alert: SSLCertificateExpired
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 0
        labels:
          severity: critical
        annotations:
          summary: "SSL certificate has EXPIRED"

2. Blackbox exporter for cert monitoring:

# Prometheus blackbox exporter config
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      fail_if_not_ssl: true
      preferred_ip_protocol: "ip4"

# Scrape config
scrape_configs:
  - job_name: 'ssl-expiry'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115

3. Alert for renewal failures:

# Cron job with proper error handling
#!/bin/bash
# /etc/cron.daily/certbot-renew

LOGFILE="/var/log/certbot-renew.log"

certbot renew --quiet >> "$LOGFILE" 2>&1

if [ $? -ne 0 ]; then
  # Send alert to PagerDuty
  curl -X POST https://events.pagerduty.com/v2/enqueue \
    -H 'Content-Type: application/json' \
    -d '{
      "routing_key": "'"$PAGERDUTY_KEY"'",
      "event_action": "trigger",
      "payload": {
        "summary": "Certbot renewal failed",
        "severity": "error",
        "source": "certbot-cron"
      }
    }'

  exit 1
fi

# Check certificate expiry
EXPIRY=$(echo | openssl s_client -servername example.com \
  -connect example.com:443 2>/dev/null | \
  openssl x509 -noout -enddate | cut -d= -f2)

echo "Certificate expires: $EXPIRY" >> "$LOGFILE"

Process Improvements

1. Certificate inventory:

# Created script to inventory all certificates
#!/bin/bash
# check-all-certs.sh

DOMAINS=(
  "example.com"
  "api.example.com"
  "admin.example.com"
  "cdn.example.com"
)

for domain in "${DOMAINS[@]}"; do
  echo "Checking $domain..."
  echo | openssl s_client -servername "$domain" \
    -connect "$domain":443 2>/dev/null | \
    openssl x509 -noout -subject -dates -issuer
  echo "---"
done

2. Documentation:

Created runbook: Certificate Renewal Procedures

  • How to check certificate status
  • Manual renewal procedure
  • Emergency backup certificate deployment
  • Rate limit troubleshooting

3. Backup certificates:

# Automated backup to S3
#!/bin/bash
# Runs daily, backs up all certificates

DATE=$(date +%Y%m%d)

# Copy current certificates
cp /etc/letsencrypt/archive/example.com/fullchain*.pem \
  /tmp/cert-backup-$DATE.pem

# Upload to S3
aws s3 cp /tmp/cert-backup-$DATE.pem \
  s3://cert-backups/example-com-$DATE.pem \
  --storage-class STANDARD_IA

# Verify backup
aws s3 ls s3://cert-backups/ | tail -5

Lessons Learned

What Went Well โœ“

  1. Quick identification - Found cause in 3 minutes
  2. Had backup certificate - Backup saved us from longer outage
  3. Good incident command - Clear leadership and communication
  4. User communication - Updated status page promptly
  5. Fast resolution - Restored service in under 2 hours

What Went Wrong โœ—

  1. No expiry monitoring - Certificate expired without warning
  2. Silent renewal failures - Certbot failures went unnoticed for 45 days
  3. Rate limiting - Hit Let’s Encrypt rate limit during testing
  4. Manual process - Certificate renewal depended on cron job
  5. No testing - Never tested backup certificate deployment procedure
  6. Cache issues - Didn’t anticipate CDN and browser caching
  7. Documentation gap - No runbook for certificate emergencies

Surprises ๐Ÿ˜ฎ

  1. How fast users noticed - Reports within 30 seconds
  2. Cache complications - CDN and browser caching prolonged incident
  3. Rate limits bite hard - Testing in production hit rate limits
  4. Backup saved us - 6-month-old backup cert was still valid
  5. nginx reload required - Expected hot reload, but manual reload needed

Action Items

Completed โœ…

ActionOwnerCompleted
Deploy backup certificateSRE Team2025-08-15
Install cert-manager in KubernetesSRE Team2025-08-16
Migrate to cert-manager managed certificatesSRE Team2025-08-17
Add Prometheus certificate monitoringSRE Team2025-08-17
Create certificate renewal runbookTech Writers2025-08-18

In Progress ๐Ÿ”„

ActionOwnerTarget Date
Audit all certificates across infrastructureSecurity Team2025-08-25
Implement certificate inventory dashboardSRE Team2025-08-30
Set up automated backup processPlatform Team2025-09-01

Planned โณ

ActionOwnerTarget Date
Move all services to automated cert managementSRE Team2025-09-15
Quarterly certificate expiry drillSRE Team2025-11-15
Implement ACME DNS-01 challenge for wildcard certsPlatform Team2025-10-01

Technical Deep Dive

SSL/TLS Certificate Lifecycle

Certificate Creation:
โ”œโ”€ Generate CSR (Certificate Signing Request)
โ”œโ”€ Submit to CA (Certificate Authority)
โ”œโ”€ Validate domain ownership
โ”œโ”€ Receive signed certificate
โ””โ”€ Install certificate on servers

Certificate Validity:
โ”œโ”€ Not Before: Start date
โ”œโ”€ Not After: Expiry date (typically 90 days for Let's Encrypt)
โ””โ”€ Renewal Window: Usually 30 days before expiry

Certificate Renewal:
โ”œโ”€ Automated (cert-manager, certbot)
โ”œโ”€ Manual (emergency only)
โ””โ”€ Backup certificates (for emergencies)

Let’s Encrypt Rate Limits

Certificates per Registered Domain: 50 per week
โ”œโ”€ example.com can issue 50 certs per week
โ””โ”€ Includes all subdomains

Duplicate Certificate Limit: 5 per week
โ”œโ”€ Same exact set of domains
โ””โ”€ THIS is what we hit during testing

Failed Validation Limit: 5 per account per hostname per hour

Certificate Validation Methods

HTTP-01 Challenge:

1. Let's Encrypt provides token
2. Place token at http://example.com/.well-known/acme-challenge/{token}
3. Let's Encrypt validates by fetching the URL
4. Certificate issued

DNS-01 Challenge:

1. Let's Encrypt provides token
2. Create TXT record at _acme-challenge.example.com
3. Let's Encrypt validates DNS record
4. Certificate issued
5. Advantage: Works for wildcard certificates

Appendix

Useful Commands

Check certificate expiry:

# Quick check
curl -vI https://example.com 2>&1 | grep -i expire

# Detailed check
echo | openssl s_client -servername example.com \
  -connect example.com:443 2>/dev/null | \
  openssl x509 -noout -dates -subject -issuer

Test certificate:

# Test locally before deploying
openssl s_client -connect example.com:443 -servername example.com < /dev/null

# Verify certificate chain
openssl verify -CAfile chain.pem certificate.pem

Check all certificates on system:

# Find all certificate files
find /etc -name "*.pem" -o -name "*.crt" 2>/dev/null

# Check each one
for cert in /etc/ssl/certs/*.pem; do
  echo "=== $cert ==="
  openssl x509 -in "$cert" -noout -enddate 2>/dev/null
done

External Tools

References


Incident Commander: Mike Johnson Contributors: Sarah Williams (On-call), Tom Anderson (Security), Lisa Chen (SRE) Postmortem Completed: 2025-08-16 Next Review: 2025-09-16 (1 month follow-up)