π SRE Practice
13 min
Observability: The Three Pillars of Metrics, Logs, and Traces
Introduction Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why itβs broken, even for issues youβve never encountered before.
Core Principle: βYou canβt fix what you canβt see. You canβt see what you donβt measure.β
The Three Pillars Overview βββββββββββββββββββββββββββββββββββββββββββ β OBSERVABILITY β βββββββββββββββ¬βββββββββββββββ¬βββββββββββββ€ β METRICS β LOGS β TRACES β βββββββββββββββΌβββββββββββββββΌβββββββββββββ€ β What/When β Why/Details β Where β β Aggregated β Individual β Causal β β Time-series β Events β Flows β β Dashboards β Search β Waterfall β βββββββββββββββ΄βββββββββββββββ΄βββββββββββββ When to Use Each:
β¦
October 16, 2025 Β· 13 min Β· DevOps Engineer
π οΈ Guide
21 min
Neo4j End-to-End Guide: Deployment, Operations & Best Practices
Executive Summary Neo4j is a native graph database that stores data as nodes (entities) connected by relationships (edges). Unlike relational databases that normalize data into tables, Neo4j excels at traversing relationships.
Quick decision:
Use Neo4j for: Knowledge graphs, authorization/identity, recommendations, fraud detection, network topology, impact analysis Donβt use for: Heavy OLAP analytics, simple key-value workloads, document storage Production deployment: Kubernetes + Helm (managed) or Docker Compose + Causal Cluster (self-managed)
β¦
October 16, 2025 Β· 21 min Β· DevOps Engineer
π οΈ Guide
15 min
Prometheus Query Optimization: PromQL Tips, Recording Rules, and Performance
Introduction Prometheus queries can become slow and resource-intensive as your metrics scale. This guide covers PromQL optimization techniques, recording rules, and performance best practices to keep your monitoring fast and efficient.
PromQL Optimization Understanding Query Performance Factors affecting query performance:
Number of time series matched Time range queried Query complexity Cardinality of labels Rate of data ingestion Check query stats:
# Grafana: Enable query inspector # Shows: Query time, series count, samples processed 1. Limit Time Series Selection Bad (matches too many series):
β¦
October 15, 2025 Β· 15 min Β· DevOps Engineer
π SRE Practice
7 min
Understanding SLOs, SLIs, and SLAs: A Practical Guide
Introduction Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are fundamental concepts in Site Reliability Engineering. Understanding and implementing them correctly is crucial for maintaining reliable services.
Core Concepts SLI (Service Level Indicator) Definition: A quantitative measure of service reliability from the userβs perspective.
Common SLIs:
Availability: Percentage of successful requests Latency: Proportion of requests served faster than threshold Throughput: Requests processed per second Error Rate: Percentage of failed requests Example SLI Definitions:
β¦
October 15, 2025 Β· 7 min Β· DevOps Engineer
π¨ Incident
8 min
Incident: Kubernetes OOMKilled - Memory Leak in Production
Incident Summary Date: 2025-09-28 Time: 14:30 UTC Duration: 2 hours 15 minutes Severity: SEV-2 (High) Impact: Intermittent service degradation and elevated error rates
Quick Facts Users Affected: ~30% of users experiencing slow responses Services Affected: User API service Error Rate: Spiked from 0.5% to 8% SLO Impact: 25% of monthly error budget consumed Timeline 14:30 - Prometheus alert: High pod restart rate detected 14:31 - On-call engineer (Dave) acknowledged, investigating 14:33 - Observed pattern: Pods restarting every 15-20 minutes 14:35 - Checked pod status: OOMKilled (exit code 137) 14:37 - Senior SRE (Emma) joined investigation 14:40 - Checked resource limits: 512MB memory limit per pod 14:42 - Reviewed recent deployments: New caching feature deployed yesterday 14:45 - Examined memory metrics: Linear growth from 100MB β 512MB over 15 min 14:50 - Hypothesis: Memory leak in new caching code 14:52 - Decision: Increase memory limit to 1GB as temporary mitigation 14:55 - Memory limit increased, pods restarted with new limits 15:00 - Pod restart frequency decreased (now every ~30 minutes) 15:05 - Confirmed leak still present, just slower with more memory 15:10 - Development team engaged to investigate caching code 15:25 - Memory leak identified: Event listeners not being removed 15:35 - Fix developed and tested locally 15:45 - Hotfix deployed to production 16:00 - Memory usage stabilized at ~180MB 16:15 - Monitoring shows no growth, pods stable 16:30 - Error rate returned to baseline 16:45 - Incident marked as resolved Root Cause Analysis What Happened On September 27th, a new feature was deployed that implemented an in-memory cache with event-driven invalidation. The cache listened to database change events to invalidate cached entries.
β¦
September 28, 2025 Β· 8 min Β· DevOps Engineer
π¨ Incident
12 min
Incident: Disk Space Exhaustion Causes Node Failures
Incident Summary Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures
Quick Facts Users Affected: ~40% experiencing intermittent errors Services Affected: Multiple microservices across 3 Kubernetes nodes Nodes Failed: 3 out of 8 worker nodes Pods Evicted: 47 pods due to disk pressure SLO Impact: 35% of monthly error budget consumed Timeline 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3 11:22:00 - On-call engineer (Tom) acknowledged alert 11:25:00 - Checked node: 92% disk usage, mostly logs 11:28:00 - Second alert: node-worker-5 also >85% 11:30:00 - Third alert: node-worker-7 >85% 11:32:00 - Senior SRE (Rachel) joined investigation 11:35:00 - Pattern identified: All nodes running logging-agent pod 11:38:00 - First node reached 98% disk usage 11:40:00 - Kubelet started evicting pods due to disk pressure 11:42:00 - 12 pods evicted from node-worker-3 11:45:00 - User reports: Intermittent 503 errors 11:47:00 - Incident escalated to SEV-2 11:50:00 - Identified root cause: Log rotation not working for logging-agent 11:52:00 - Emergency: Manual log cleanup on affected nodes 11:58:00 - First node cleaned: 92% β 45% disk usage 12:05:00 - Second node cleaned: 88% β 40% disk usage 12:10:00 - Third node cleaned: 95% β 42% disk usage 12:15:00 - All evicted pods rescheduled and running 12:30:00 - Deployed fix for log rotation issue 12:45:00 - Monitoring shows disk usage stabilizing 13:00:00 - Implemented automated log cleanup job 13:30:00 - Added improved monitoring and alerts 14:15:00 - Verified all nodes healthy, services normal 15:05:00 - Incident marked as resolved Root Cause Analysis What Happened A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely.
β¦
July 22, 2025 Β· 12 min Β· DevOps Engineer