Operations

Browse all Operations

📊 SRE Practice 18 min

Disaster Recovery Planning: RTO, RPO, and Building Resilient Systems

Introduction Disaster Recovery (DR) is the process, policies, and procedures for recovering and continuing technology infrastructure after a disaster. A disaster can be natural (earthquake, flood), technical (data center failure, ransomware), or human-caused (accidental deletion, security breach). Core Principle: “Hope is not a strategy. Plan for failure before it happens.” Key Concepts RTO vs RPO Time ─────────────────────────────────────────────────────────────> │ │ │ │ Disaster Detection Recovery Normal Occurs Time Begins Operations │◄──────────────────────────────────►│ │ Recovery Time │ │ Objective (RTO) │ │ │ │◄───────────►│ │ Data Loss │ (Recovery Point │ Objective - RPO) │ Recovery Time Objective (RTO) Definition: Maximum acceptable time that a system can be down after a disaster. …

October 16, 2025 · 18 min · DevOps Engineer

📊 SRE Practice 23 min

Chaos Engineering: Building Resilient Systems Through Controlled Experiments

Introduction Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Rather than waiting for failures to happen, chaos engineering proactively injects failures to identify weaknesses before they impact users. Why does this matter? In modern distributed systems (microservices, cloud infrastructure, containers), failures are inevitable. A network can partition, a server can crash, a database can slow down. Traditional testing can’t predict all the ways these components interact when things go wrong. Chaos engineering fills this gap by deliberately causing failures in a controlled way. …

October 16, 2025 · 23 min · DevOps Engineer

chaos-engineering resilience testing observability incident-response

17 min

Linux Boot Flow & Debugging: From Firmware to systemd

Executive Summary Linux boot is a multi-stage handoff: UEFI → Bootloader → Kernel → systemd → Targets → Units. Each stage has failure points. This guide shows the sequence, where failures occur, and how to capture logs. Why understanding boot flow matters: When a Linux server won’t boot, you need to know WHICH stage failed to fix it effectively. A black screen could mean anything from bad hardware to a typo in /etc/fstab. …

October 16, 2025 · 17 min · DevOps Engineer

linux boot grub systemd kernel

32 min

Linux Core Subsystems: One-Page Reference Map

Overview This is a one-page cheat sheet for Linux kernel subsystems. Each subsystem controls a critical resource; understanding them is essential for troubleshooting, optimization, and security. Why understanding subsystems matters: Imagine your server is slow. Without subsystem knowledge, you’re guessing: “Maybe add more RAM?” (might be CPU scheduler issue) “Maybe faster disk?” (might be memory cache problem) “Maybe more CPU?” (might be I/O scheduler misconfiguration) With subsystem knowledge, you diagnose systematically: …

October 16, 2025 · 32 min · DevOps Engineer

linux kernel scheduler memory filesystem

15 min

Linux Networking: systemd-networkd, IPv6, nftables, and Load Balancer Configuration

Executive Summary Networking baseline = reliable, secure, predictable connectivity with proper tuning for your infrastructure. Why networking configuration matters: Most production outages trace back to network issues: misconfigured firewall blocking traffic, exhausted connection tables, or timeouts set too aggressively. Proper networking prevents these disasters. Real-world disasters prevented by good networking: 1. Firewall accidentally blocks production traffic: Problem: Engineer adds SSH rule, accidentally sets policy to "drop all" Result: Website goes down, SSH also blocked (can't fix it remotely) Prevention: Test firewall rules with policy "accept" first, then switch to "drop" 2. Connection tracking table exhausted: …

October 16, 2025 · 15 min · DevOps Engineer

linux networking systemd-networkd networkmanager ipv6

11 min

Linux Observability: Metrics, Logs, eBPF Tools, and 5-Minute Triage

Executive Summary Observability = see inside your systems: metrics (CPU, memory, I/O), logs (audit trail), traces (syscalls, latency). This guide covers: Metrics: node_exporter → Prometheus (system-level health) Logs: journald → rsyslog/Vector/Fluent Bit (aggregation) eBPF tools: 5 quick wins (trace syscalls, network, I/O) Triage: 5-minute flowchart to diagnose CPU, memory, I/O, network issues 1. Metrics: node_exporter & Prometheus What It Is node_exporter: Exposes OS metrics (CPU, memory, disk, network) as Prometheus scrape target Prometheus: Time-series database; collects metrics, queries, alerts Dashboard: Grafana visualizes Prometheus data Install node_exporter Ubuntu/Debian: …

October 16, 2025 · 11 min · DevOps Engineer

linux observability prometheus metrics logging

13 min

Linux Performance Baseline: sysctl, ulimits, CPU Governor, and NUMA

Executive Summary Performance baseline = safe defaults that work for most workloads, with clear tuning for specific scenarios. This guide covers: sysctl: Kernel parameters (network, filesystem, VM) with production-safe values ulimits: Resource limits (open files, processes, memory locks) CPU Governor: Frequency scaling & power management on servers NUMA: Awareness for multi-socket systems (big apps, databases) I/O Scheduler: NVMe/SSD vs. spinning disk tuning 1. sysctl Kernel Parameters Why sysctl Matters Problem: Default kernel parameters are conservative (fit laptops, embedded systems) Solution: Tune for your workload (databases, web servers, HPC) Trade-off: More throughput vs. latency / memory vs. stability …

October 16, 2025 · 13 min · DevOps Engineer

linux performance tuning sysctl ulimits

9 min

Linux Production Guide: Kernel Subsystems, Systemd, and Best Practices

Executive Summary Linux is a layered system: from firmware through kernel subsystems to containerized applications. Understanding these layers—and their interdependencies—is critical for reliable, secure, performant infrastructure. This guide covers: Layered architecture (firmware → kernel → userspace → containers) Core subsystems: process scheduling, memory, filesystems, networking systemd: unit management and service lifecycle Production best practices: security, reliability, performance, observability Note: For detailed boot flow and debugging, see the Linux Boot Flow & Debugging guide. …

October 16, 2025 · 9 min · DevOps Engineer

linux kernel systemd namespaces cgroups

12 min

Linux Reliability & Lifecycle: Time Sync, Logging, Shutdown, and Patching

Executive Summary Reliability means predictable, auditable behavior. This guide covers: Time sync: Chrony for clock accuracy (critical for logging, security) Networking: Stable interface names & hostnames (infrastructure consistency) Logging: Persistent journald + logrotate (audit trail + disk management) Shutdown: Clean hooks to prevent data loss Patching: Kernel updates with livepatch (zero-downtime), tested rollback 1. Time Synchronization (chrony) Why Time Matters Critical for: Logging: Accurate timestamps for debugging, compliance audits Security: TLS cert validation, Kerberos, API token expiry Distributed systems: Causality ordering (happens-before relationships) Monitoring: Alert timing, metric correlation Cost of poor time sync: …

October 16, 2025 · 12 min · DevOps Engineer

linux time-sync chrony journald logging

12 min

Linux Security Baseline for Production Servers

Executive Summary A security baseline is the foundation: OS-hardened, patched, with restricted access and audit trails. This guide covers minimal-install servers with hardened SSH, firewall (default-deny), LSM enforcement, least-privilege sudo, audit logging, and systemd hardening. Goal: Reduce attack surface, detect breaches, and enforce privilege boundaries. 1. Minimal Install & Patching Minimal Install What it is: Install only required packages (base + SSH + monitoring agent) No GUI, X11, unnecessary daemons Reduces vulnerabilities (fewer packages = fewer CVEs) Install steps (Ubuntu/Debian): …

October 16, 2025 · 12 min · DevOps Engineer

linux security ssh firewall selinux

18 min

Linux Storage: Partitions, LVM, ZFS/Btrfs, Filesystems, Snapshots, and Backups

Executive Summary Storage strategy = reliable data access with recovery guarantees. Choose based on workload (traditional vs. modern, single vs. multi-server). Why storage matters: Most production disasters involve storage—disk fills up and crashes your application, a database corruption loses customer data, or backups fail during a restore attempt. Proper storage management prevents these scenarios. Real-world disasters prevented by good storage: Disk full at 3 AM: Application logs fill /var → server crashes → customers can’t access site. Solution: Separate /var/log partition with monitoring. Failed database restore: Backup runs nightly for 2 years, never tested. Server dies, restore fails (corrupt backup). Solution: Monthly test restores. Database grows unexpectedly: 50GB database grows to 500GB in 3 months. Traditional partitions = downtime to resize. Solution: LVM allows online growth. Ransomware encrypts production data: No snapshots available, last backup is 24 hours old. Solution: ZFS/Btrfs snapshots provide instant point-in-time recovery. This guide covers: …

October 16, 2025 · 18 min · DevOps Engineer

linux storage lvm zfs btrfs

📊 SRE Practice 13 min

On-Call Runbook: Template and Best Practices

Introduction An on-call runbook is a documented set of procedures and information that helps engineers respond to incidents effectively. A good runbook reduces Mean Time To Resolution (MTTR), decreases stress, and enables any team member to handle incidents confidently. Why Runbooks Matter Without Runbooks Every incident requires figuring things out from scratch Tribal knowledge lost when team members leave New team members struggle with on-call Inconsistent incident response Higher MTTR and more user impact With Runbooks Standardized, tested response procedures Knowledge preserved and shared Faster onboarding for new team members Consistent, reliable incident response Lower MTTR and reduced stress Runbook Structure Essential Sections Service Overview - What the service does Architecture - Key components and dependencies Common Alerts - What triggers pages and how to respond Troubleshooting Guide - Diagnostic steps and solutions Escalation Procedures - When and how to escalate Emergency Contacts - Who to reach for help Rollback Procedures - How to revert changes Useful Commands - Quick reference for common tasks Complete Runbook Template # On-Call Runbook: [Service Name] **Last Updated:** YYYY-MM-DD **Maintained By:** [Team Name] **On-Call Schedule:** [Link to PagerDuty/Opsgenie] --- ## Service Overview ### Purpose [What does this service do? Why does it matter?] Example: "The Payment Service processes all customer payments including credit cards, PayPal, and gift cards. It handles ~500 transactions/minute and is critical for revenue generation. Any downtime directly impacts sales." ### Key Metrics - **Traffic:** [Requests per minute/hour] - **Latency:** [P50, P95, P99 response times] - **Error Rate:** [Typical error percentage] - **SLO:** [Availability and performance targets] ### Business Impact - **High Impact Hours:** [Peak times when incidents matter most] - **Estimated Revenue Impact:** [Cost per minute of downtime] - **Affected Users:** [Number/type of users impacted by outage] --- ## Architecture ### System Components ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │ API GW │─────▶│Payment Service│─────▶│ Database │ └─────────────┘ └──────────────┘ └─────────────┘ │ ├─────▶ Stripe API ├─────▶ PayPal API └─────▶ Fraud Detection …

October 15, 2025 · 13 min · DevOps Engineer

oncall runbook incident-response operations