π SRE Practice
13 min
Observability: The Three Pillars of Metrics, Logs, and Traces
Introduction Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why itβs broken, even for issues youβve never encountered before.
Core Principle: βYou canβt fix what you canβt see. You canβt see what you donβt measure.β
The Three Pillars Overview βββββββββββββββββββββββββββββββββββββββββββ β OBSERVABILITY β βββββββββββββββ¬βββββββββββββββ¬βββββββββββββ€ β METRICS β LOGS β TRACES β βββββββββββββββΌβββββββββββββββΌβββββββββββββ€ β What/When β Why/Details β Where β β Aggregated β Individual β Causal β β Time-series β Events β Flows β β Dashboards β Search β Waterfall β βββββββββββββββ΄βββββββββββββββ΄βββββββββββββ When to Use Each:
β¦
October 16, 2025 Β· 13 min Β· DevOps Engineer
11 min
Linux Observability: Metrics, Logs, eBPF Tools, and 5-Minute Triage
Executive Summary Observability = see inside your systems: metrics (CPU, memory, I/O), logs (audit trail), traces (syscalls, latency).
This guide covers:
Metrics: node_exporter β Prometheus (system-level health) Logs: journald β rsyslog/Vector/Fluent Bit (aggregation) eBPF tools: 5 quick wins (trace syscalls, network, I/O) Triage: 5-minute flowchart to diagnose CPU, memory, I/O, network issues 1. Metrics: node_exporter & Prometheus What It Is node_exporter: Exposes OS metrics (CPU, memory, disk, network) as Prometheus scrape target Prometheus: Time-series database; collects metrics, queries, alerts Dashboard: Grafana visualizes Prometheus data Install node_exporter Ubuntu/Debian:
β¦
October 16, 2025 Β· 11 min Β· DevOps Engineer
π οΈ Guide
15 min
Prometheus Query Optimization: PromQL Tips, Recording Rules, and Performance
Introduction Prometheus queries can become slow and resource-intensive as your metrics scale. This guide covers PromQL optimization techniques, recording rules, and performance best practices to keep your monitoring fast and efficient.
PromQL Optimization Understanding Query Performance Factors affecting query performance:
Number of time series matched Time range queried Query complexity Cardinality of labels Rate of data ingestion Check query stats:
# Grafana: Enable query inspector # Shows: Query time, series count, samples processed 1. Limit Time Series Selection Bad (matches too many series):
β¦
October 15, 2025 Β· 15 min Β· DevOps Engineer