Monitoring

Browse all Monitoring

📊 SRE Practice 13 min

Observability: The Three Pillars of Metrics, Logs, and Traces

Introduction Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why it’s broken, even for issues you’ve never encountered before. Core Principle: “You can’t fix what you can’t see. You can’t see what you don’t measure.” The Three Pillars Overview ┌─────────────────────────────────────────┐ │ OBSERVABILITY │ ├─────────────┬──────────────┬────────────┤ │ METRICS │ LOGS │ TRACES │ ├─────────────┼──────────────┼────────────┤ │ What/When │ Why/Details │ Where │ │ Aggregated │ Individual │ Causal │ │ Time-series │ Events │ Flows │ │ Dashboards │ Search │ Waterfall │ └─────────────┴──────────────┴────────────┘ When to Use Each: …

October 16, 2025 · 13 min · DevOps Engineer

🛠️ Guide 15 min

Prometheus Query Optimization: PromQL Tips, Recording Rules, and Performance

Introduction Prometheus queries can become slow and resource-intensive as your metrics scale. This guide covers PromQL optimization techniques, recording rules, and performance best practices to keep your monitoring fast and efficient. PromQL Optimization Understanding Query Performance Factors affecting query performance: Number of time series matched Time range queried Query complexity Cardinality of labels Rate of data ingestion Check query stats: # Grafana: Enable query inspector # Shows: Query time, series count, samples processed 1. Limit Time Series Selection Bad (matches too many series): …

October 15, 2025 · 15 min · DevOps Engineer

prometheus promql monitoring performance optimization

🚨 Incident 12 min

Incident: Disk Space Exhaustion Causes Node Failures

Incident Summary Date: 2025-07-22 Time: 11:20 UTC Duration: 3 hours 45 minutes Severity: SEV-2 (High) Impact: Progressive service degradation with intermittent failures Quick Facts Users Affected: ~40% experiencing intermittent errors Services Affected: Multiple microservices across 3 Kubernetes nodes Nodes Failed: 3 out of 8 worker nodes Pods Evicted: 47 pods due to disk pressure SLO Impact: 35% of monthly error budget consumed Timeline 11:20:00 - Prometheus alert: Node disk usage >85% on node-worker-3 11:22:00 - On-call engineer (Tom) acknowledged alert 11:25:00 - Checked node: 92% disk usage, mostly logs 11:28:00 - Second alert: node-worker-5 also >85% 11:30:00 - Third alert: node-worker-7 >85% 11:32:00 - Senior SRE (Rachel) joined investigation 11:35:00 - Pattern identified: All nodes running logging-agent pod 11:38:00 - First node reached 98% disk usage 11:40:00 - Kubelet started evicting pods due to disk pressure 11:42:00 - 12 pods evicted from node-worker-3 11:45:00 - User reports: Intermittent 503 errors 11:47:00 - Incident escalated to SEV-2 11:50:00 - Identified root cause: Log rotation not working for logging-agent 11:52:00 - Emergency: Manual log cleanup on affected nodes 11:58:00 - First node cleaned: 92% → 45% disk usage 12:05:00 - Second node cleaned: 88% → 40% disk usage 12:10:00 - Third node cleaned: 95% → 42% disk usage 12:15:00 - All evicted pods rescheduled and running 12:30:00 - Deployed fix for log rotation issue 12:45:00 - Monitoring shows disk usage stabilizing 13:00:00 - Implemented automated log cleanup job 13:30:00 - Added improved monitoring and alerts 14:15:00 - Verified all nodes healthy, services normal 15:05:00 - Incident marked as resolved Root Cause Analysis What Happened A logging agent (Fluentd) was deployed on all Kubernetes nodes to collect and forward logs to Elasticsearch. Due to a configuration error, log rotation was not working properly, causing log files to grow indefinitely. …

July 22, 2025 · 12 min · DevOps Engineer

disk-space kubernetes logs monitoring incident