🛠️ Guide
10 min
Docker Best Practices: Multi-stage Builds, Optimization, and Security
Introduction Building efficient and secure Docker images requires following best practices that reduce image size, improve build times, and minimize security vulnerabilities. This guide covers essential techniques for production-ready containers.
Multi-Stage Builds The Problem: Bloated Images Before (single-stage build):
FROM node:18 WORKDIR /app # Install dependencies COPY package*.json ./ RUN npm install # Includes devDependencies # Copy source COPY . . # Build RUN npm run build # Runtime includes build tools and dependencies CMD ["node", "dist/index.js"] Result: 1.2GB image with unnecessary build tools and dependencies.
…
October 15, 2025 · 10 min · DevOps Engineer
🛠️ Guide
12 min
Kubernetes Troubleshooting: Pod Crashes, Networking, and Resources
Introduction Kubernetes troubleshooting can be challenging due to its distributed nature and multiple abstraction layers. This guide covers the most common issues and systematic approaches to diagnosing and fixing them.
Pod Crash Loops Understanding CrashLoopBackOff What it means: The pod starts, crashes, restarts, and repeats in an exponential backoff pattern.
Diagnostic Process Step 1: Check pod status
kubectl get pods -n production # Output: # NAME READY STATUS RESTARTS AGE # myapp-7d8f9c6b5-xyz12 0/1 CrashLoopBackOff 5 10m Step 2: Describe the pod
…
October 15, 2025 · 12 min · DevOps Engineer
🚨 Incident
8 min
Incident: Kubernetes OOMKilled - Memory Leak in Production
Incident Summary Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates
Quick Facts Users Affected: ~30% of users experiencing slow responses Services Affected: User API service Error Rate: Spiked from 0.5% to 8% SLO Impact: 25% of monthly error budget consumed Timeline 14:30 - Prometheus alert: High pod restart rate detected 14:31 - On-call engineer (Dave) acknowledged, investigating 14:33 - Observed pattern: Pods restarting every 15-20 minutes 14:35 - Checked pod status: OOMKilled (exit code 137) 14:37 - Senior SRE (Emma) joined investigation 14:40 - Checked resource limits: 512MB memory limit per pod 14:42 - Reviewed recent deployments: New caching feature deployed yesterday 14:45 - Examined memory metrics: Linear growth from 100MB → 512MB over 15 min 14:50 - Hypothesis: Memory leak in new caching code 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation 14:55 - Memory limit increased, pods restarted with new limits 15:00 - Pod restart frequency decreased (now every ~30 minutes) 15:05 - Confirmed leak still present, just slower with more memory 15:10 - Development team engaged to investigate caching code 15:25 - Memory leak identified: Event listeners not being removed 15:35 - Fix developed and tested locally 15:45 - Hotfix deployed to production 16:00 - Memory usage stabilized at ~180MB 16:15 - Monitoring shows no growth, pods stable 16:30 - Error rate returned to baseline 16:45 - Incident marked as resolved Root Cause Analysis What Happened On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.
…
September 28, 2025 · 8 min · DevOps Engineer