12 min
Linux Security Baseline for Production Servers
Executive Summary A security baseline is the foundation: OS-hardened, patched, with restricted access and audit trails. This guide covers minimal-install servers with hardened SSH, firewall (default-deny), LSM enforcement, least-privilege sudo, audit logging, and systemd hardening.
Goal: Reduce attack surface, detect breaches, and enforce privilege boundaries.
1. Minimal Install & Patching Minimal Install What it is:
Install only required packages (base + SSH + monitoring agent) No GUI, X11, unnecessary daemons Reduces vulnerabilities (fewer packages = fewer CVEs) Install steps (Ubuntu/Debian):
…
October 16, 2025 · 12 min · DevOps Engineer
🛠️ Guide
10 min
Docker Best Practices: Multi-stage Builds, Optimization, and Security
Introduction Building efficient and secure Docker images requires following best practices that reduce image size, improve build times, and minimize security vulnerabilities. This guide covers essential techniques for production-ready containers.
Multi-Stage Builds The Problem: Bloated Images Before (single-stage build):
FROM node:18 WORKDIR /app # Install dependencies COPY package*.json ./ RUN npm install # Includes devDependencies # Copy source COPY . . # Build RUN npm run build # Runtime includes build tools and dependencies CMD ["node", "dist/index.js"] Result: 1.2GB image with unnecessary build tools and dependencies.
…
October 15, 2025 · 10 min · DevOps Engineer
📊 SRE Practice
5 min
SSL Certificate Renewal Runbook
SSL Certificate Renewal Runbook Quick Reference:
Duration: 5-15 minutes Impact: Minimal (usually zero downtime) Requires: Sudo access to load balancer or cert-manager admin Severity: HIGH (expired certs = full outage) Prerequisites What You Need SSH access to certificate servers or kubectl access to cluster Let’s Encrypt API credentials (if manual renewal) Backup certificate location documented 30+ days lead time before expiry (not 5 minutes!) Check Current Status # View certificate expiry openssl s_client -connect example.com:443 -showcerts | grep -A5 "Verify return code" # Or via Kubernetes kubectl get certificate -n ingress-nginx kubectl describe certificate prod-cert -n ingress-nginx # Check expiry date specifically openssl x509 -in /etc/ssl/certs/server.crt -noout -enddate Automatic Renewal (Preferred) Using cert-manager (Kubernetes) apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: prod-cert namespace: ingress-nginx spec: secretName: prod-tls-secret issuerRef: name: letsencrypt-prod kind: ClusterIssuer dnsNames: - example.com - "*.example.com" Cert-manager automatically:
…
August 15, 2025 · 5 min · DevOps Engineer
🚨 Incident
9 min
Incident: SSL Certificate Expiry Causes Complete Outage
Incident Summary Date: 2025-08-15 Time: 08:00 UTC Duration: 1 hour 45 minutes Severity: SEV-1 (Critical) Impact: Complete service unavailability for all users
Quick Facts Users Affected: 100% - all external traffic Services Affected: All public-facing services Revenue Impact: ~$12,000 in lost sales SLO Impact: 80% of monthly error budget consumed in single incident Timeline 08:00:00 - SSL certificate expired (not detected) 08:00:30 - User reports started coming in: “Your connection is not private” 08:02:00 - PagerDuty alert: Health check failures from external monitoring 08:02:30 - On-call engineer (Sarah) acknowledged alert 08:03:00 - Opened website, saw SSL certificate error 08:03:30 - Checked certificate expiry: Expired at 08:00 UTC 08:04:00 - Root cause identified: SSL certificate expired 08:04:30 - Incident escalated to SEV-1, incident commander assigned 08:05:00 - Senior SRE (Mike) joined as incident commander 08:06:00 - Attempted automatic renewal with certbot: Failed - rate limit exceeded 08:08:00 - Checked Let’s Encrypt rate limits: Hit weekly renewal limit 08:10:00 - Decision: Use backup certificate from 6 months ago (still valid) 08:12:00 - Located backup certificate in secure storage 08:15:00 - Deployed backup certificate to load balancer 08:18:00 - Certificate updated, but services still showing errors 08:20:00 - Discovered cached certificate in CDN (Cloudflare) 08:22:00 - Purged Cloudflare cache 08:25:00 - Still seeing errors from some users 08:27:00 - Realized nginx not reloaded after certificate update 08:30:00 - Reloaded nginx on all load balancers 08:33:00 - Service partially restored, some users still affected 08:35:00 - Identified browser certificate caching 08:38:00 - Communicated workaround to users (clear browser cache) 08:45:00 - Traffic gradually recovering 09:00:00 - 90% of users able to access site 09:30:00 - 98% recovery, remaining issues browser caching 09:45:00 - Incident marked as resolved Root Cause Analysis What Happened Primary cause: SSL certificate for *.example.com expired at 08:00 UTC on August 15th, 2025.
…
August 15, 2025 · 9 min · DevOps Engineer