Tls | DevOps & SRE Blog

📊 SRE Practice 5 min

SSL Certificate Renewal Runbook

SSL Certificate Renewal Runbook Quick Reference: Duration: 5-15 minutes Impact: Minimal (usually zero downtime) Requires: Sudo access to load balancer or cert-manager admin Severity: HIGH (expired certs = full outage) Prerequisites What You Need SSH access to certificate servers or kubectl access to cluster Let’s Encrypt API credentials (if manual renewal) Backup certificate location documented 30+ days lead time before expiry (not 5 minutes!) Check Current Status # View certificate expiry openssl s_client -connect example.com:443 -showcerts | grep -A5 "Verify return code" # Or via Kubernetes kubectl get certificate -n ingress-nginx kubectl describe certificate prod-cert -n ingress-nginx # Check expiry date specifically openssl x509 -in /etc/ssl/certs/server.crt -noout -enddate Automatic Renewal (Preferred) Using cert-manager (Kubernetes) apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: prod-cert namespace: ingress-nginx spec: secretName: prod-tls-secret issuerRef: name: letsencrypt-prod kind: ClusterIssuer dnsNames: - example.com - "*.example.com" Cert-manager automatically: …

August 15, 2025 · 5 min · DevOps Engineer

🚨 Incident 9 min

Incident: SSL Certificate Expiry Causes Complete Outage

Incident Summary Date: 2025-08-15 Time: 08:00 UTC Duration: 1 hour 45 minutes Severity: SEV-1 (Critical) Impact: Complete service unavailability for all users Quick Facts Users Affected: 100% - all external traffic Services Affected: All public-facing services Revenue Impact: ~$12,000 in lost sales SLO Impact: 80% of monthly error budget consumed in single incident Timeline 08:00:00 - SSL certificate expired (not detected) 08:00:30 - User reports started coming in: “Your connection is not private” 08:02:00 - PagerDuty alert: Health check failures from external monitoring 08:02:30 - On-call engineer (Sarah) acknowledged alert 08:03:00 - Opened website, saw SSL certificate error 08:03:30 - Checked certificate expiry: Expired at 08:00 UTC 08:04:00 - Root cause identified: SSL certificate expired 08:04:30 - Incident escalated to SEV-1, incident commander assigned 08:05:00 - Senior SRE (Mike) joined as incident commander 08:06:00 - Attempted automatic renewal with certbot: Failed - rate limit exceeded 08:08:00 - Checked Let’s Encrypt rate limits: Hit weekly renewal limit 08:10:00 - Decision: Use backup certificate from 6 months ago (still valid) 08:12:00 - Located backup certificate in secure storage 08:15:00 - Deployed backup certificate to load balancer 08:18:00 - Certificate updated, but services still showing errors 08:20:00 - Discovered cached certificate in CDN (Cloudflare) 08:22:00 - Purged Cloudflare cache 08:25:00 - Still seeing errors from some users 08:27:00 - Realized nginx not reloaded after certificate update 08:30:00 - Reloaded nginx on all load balancers 08:33:00 - Service partially restored, some users still affected 08:35:00 - Identified browser certificate caching 08:38:00 - Communicated workaround to users (clear browser cache) 08:45:00 - Traffic gradually recovering 09:00:00 - 90% of users able to access site 09:30:00 - 98% recovery, remaining issues browser caching 09:45:00 - Incident marked as resolved Root Cause Analysis What Happened Primary cause: SSL certificate for *.example.com expired at 08:00 UTC on August 15th, 2025. …

August 15, 2025 · 9 min · DevOps Engineer

ssl certificate expiry tls incident