Observability: The Three Pillars of Metrics, Logs, and Traces

Introduction

Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why it’s broken, even for issues you’ve never encountered before.

Core Principle: “You can’t fix what you can’t see. You can’t see what you don’t measure.”

The Three Pillars

Overview

┌─────────────────────────────────────────┐
│           OBSERVABILITY                 │
├─────────────┬──────────────┬────────────┤
│   METRICS   │     LOGS     │   TRACES   │
├─────────────┼──────────────┼────────────┤
│ What/When   │  Why/Details │   Where    │
│ Aggregated  │  Individual  │  Causal    │
│ Time-series │  Events      │  Flows     │
│ Dashboards  │  Search      │  Waterfall │
└─────────────┴──────────────┴────────────┘

When to Use Each:

Question	Pillar	Example
Is my API slow?	Metrics	“P95 latency is 500ms”
Why is it slow?	Logs	“Database query timeout in order service”
Where is the bottleneck?	Traces	“80% of time spent in payment API call”

Pillar 1: Metrics

What Are Metrics?

Definition: Numerical measurements captured at regular intervals (time-series data).

Characteristics:

Aggregated over time
Constant storage size (fixed cardinality)
Efficient for alerting
Great for dashboards
Shows trends and patterns

Types of Metrics

1. Counters

Always increasing (or reset to zero).

# Example: HTTP requests counter
from prometheus_client import Counter

http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Increment counter
http_requests_total.labels(
    method='GET',
    endpoint='/api/orders',
    status='200'
).inc()

Use cases:

Total requests
Total errors
Bytes sent/received

PromQL queries:

# Request rate (per second)
rate(http_requests_total[5m])

# Errors per second
rate(http_requests_total{status=~"5.."}[5m])

# Total requests in last hour
increase(http_requests_total[1h])

2. Gauges

Value that can go up or down.

# Example: Active connections
from prometheus_client import Gauge

active_connections = Gauge(
    'active_connections',
    'Number of active database connections'
)

# Set gauge value
active_connections.set(42)

# Increment/decrement
active_connections.inc(5)  # Add 5
active_connections.dec(2)  # Remove 2

Use cases:

Current memory usage
Active connections
Queue depth
Temperature

PromQL queries:

# Current memory usage
process_resident_memory_bytes

# Average queue depth over 5 minutes
avg_over_time(queue_depth[5m])

# CPU usage percentage
rate(process_cpu_seconds_total[5m]) * 100

3. Histograms

Distribution of measurements (latency, size, etc.).

# Example: Request duration histogram
from prometheus_client import Histogram

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0]  # Bucket boundaries
)

# Observe a value
with request_duration.labels(method='GET', endpoint='/api/users').time():
    # Your code here
    process_request()

Automatically generates:

_sum: Total sum of all observations
_count: Total count of observations
_bucket: Count per bucket

PromQL queries:

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P50 (median) latency
histogram_quantile(0.50,
  rate(http_request_duration_seconds_bucket[5m])
)

# Average latency
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

4. Summaries

Pre-calculated percentiles (alternative to histograms).

# Example: Request size summary
from prometheus_client import Summary

request_size = Summary(
    'http_request_size_bytes',
    'HTTP request size',
    ['method']
)

request_size.labels(method='POST').observe(1024)

When to use:

Histogram: Better for aggregation across instances
Summary: Better for client-side percentiles

Metric Design Best Practices

Good Metric Names

# ✅ Good (clear, consistent naming)
http_requests_total
http_request_duration_seconds
database_connection_pool_active
cache_hits_total

# ❌ Bad (unclear, inconsistent)
requests
latency_ms
db_conns
hits

Naming convention:

<namespace>_<subsystem>_<name>_<unit>

Examples:
- http_request_duration_seconds
- database_queries_total
- cache_size_bytes

Label Guidelines

# ✅ Good (low cardinality)
http_requests_total{
    method="GET",           # Limited values: GET, POST, etc.
    endpoint="/api/users",  # Limited values: known endpoints
    status="200"            # Limited values: HTTP status codes
}

# ❌ Bad (high cardinality - will explode storage)
http_requests_total{
    user_id="12345",        # Unlimited values
    session_id="abc...",    # Unlimited values
    timestamp="2025..."     # Unlimited values
}

Label cardinality limits:

Total unique label combinations: <10,000 per metric
Values per label: <100 ideally

Implementing Metrics

Prometheus + Grafana Stack

1. Install Prometheus

# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus-config

2. Configure Scraping

# prometheus-config.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod

    relabel_configs:
    # Scrape pods with prometheus.io/scrape annotation
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

    # Use custom port if specified
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2

3. Instrument Application

# Flask application with Prometheus metrics
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
import time

app = Flask(__name__)

# Metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.route('/api/users')
def get_users():
    start = time.time()

    # Your logic here
    users = fetch_users()

    # Record metrics
    duration = time.time() - start
    REQUEST_DURATION.labels(method='GET', endpoint='/api/users').observe(duration)
    REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()

    return users

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

4. Create Grafana Dashboard

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total[5m])) by (endpoint)"
        }],
        "type": "graph"
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
        }],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
        }],
        "type": "stat"
      }
    ]
  }
}

Key Metrics to Track

RED Method (for request-driven services)

# Rate: Requests per second
sum(rate(http_requests_total[5m]))

# Errors: Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Duration: Latency percentiles
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

USE Method (for resources)

# Utilization: CPU usage
rate(process_cpu_seconds_total[5m]) * 100

# Saturation: Queue depth
queue_depth

# Errors: Error count
rate(errors_total[5m])

Pillar 2: Logs

What Are Logs?

Definition: Immutable, timestamped records of discrete events.

Characteristics:

Event-driven (not sampled)
Rich context (full details)
High volume
Searchable
Debugging-focused

Log Levels

# Standard log levels (Python example)
import logging

logging.debug("Detailed diagnostic info")     # Development only
logging.info("Informational messages")        # Normal operations
logging.warning("Warning: potential issue")   # Things to watch
logging.error("Error: something failed")      # Errors
logging.critical("Critical: system unstable") # Severe issues

When to use each:

DEBUG: Function entry/exit, variable values
INFO: User login, job started, configuration loaded
WARNING: Deprecated API used, retry attempted
ERROR: Request failed, database unavailable
CRITICAL: Service crash, data corruption

Structured Logging

Unstructured (Bad)

# ❌ Hard to parse, search, analyze
logging.info("User [email protected] logged in from IP 192.168.1.100 at 2025-10-16 14:30:00")

Structured (Good)

# ✅ Easy to parse, search, analyze
import structlog

log = structlog.get_logger()

log.info(
    "user_login",
    user_email="[email protected]",
    ip_address="192.168.1.100",
    timestamp="2025-10-16T14:30:00Z",
    user_agent="Mozilla/5.0..."
)

JSON output:

{
  "event": "user_login",
  "user_email": "[email protected]",
  "ip_address": "192.168.1.100",
  "timestamp": "2025-10-16T14:30:00Z",
  "user_agent": "Mozilla/5.0...",
  "level": "info",
  "logger": "auth_service"
}

Log Aggregation Architecture

┌──────────────┐
│ Application  │ ──> Write logs to stdout/stderr
└──────────────┘
       │
       ▼
┌──────────────┐
│  Log Agent   │ ──> Collect logs (Filebeat, Fluentd)
│ (per node)   │
└──────────────┘
       │
       ▼
┌──────────────┐
│  Aggregator  │ ──> Process & enrich (Logstash)
└──────────────┘
       │
       ▼
┌──────────────┐
│   Storage    │ ──> Store & index (Elasticsearch)
└──────────────┘
       │
       ▼
┌──────────────┐
│ Visualization│ ──> Search & analyze (Kibana)
└──────────────┘

ELK Stack Implementation

1. Elasticsearch Deployment

# elasticsearch-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
        env:
        - name: cluster.name
          value: "prod-logs"
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

2. Filebeat Configuration

# filebeat-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
spec:
  selector:
    matchLabels:
      app: filebeat
  template:
    metadata:
      labels:
        app: filebeat
    spec:
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:8.10.0
        volumeMounts:
        - name: config
          mountPath: /usr/share/filebeat/filebeat.yml
          subPath: filebeat.yml
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: filebeat-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

# filebeat-config.yaml
filebeat.inputs:
- type: container
  paths:
    - '/var/log/containers/*.log'

  # Kubernetes enrichment
  processors:
  - add_kubernetes_metadata:
      host: ${NODE_NAME}
      matchers:
      - logs_path:
          logs_path: "/var/log/containers/"

  # Parse JSON logs
  - decode_json_fields:
      fields: ["message"]
      target: ""
      overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "logs-%{[agent.version]}-%{+yyyy.MM.dd}"

setup.template.name: "logs"
setup.template.pattern: "logs-*"

3. Application Logging

# Python application with structured logging
import structlog
import sys

# Configure structlog
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

log = structlog.get_logger()

# Usage
def process_order(order_id, user_id, amount):
    log.info(
        "order_processing_started",
        order_id=order_id,
        user_id=user_id,
        amount=amount
    )

    try:
        # Process order
        payment_result = charge_payment(user_id, amount)

        log.info(
            "order_completed",
            order_id=order_id,
            payment_id=payment_result.id,
            duration_ms=payment_result.duration
        )

    except PaymentFailedException as e:
        log.error(
            "order_payment_failed",
            order_id=order_id,
            user_id=user_id,
            amount=amount,
            error=str(e),
            error_code=e.code
        )
        raise

Log Searching and Analysis

Kibana Query Examples

# Find all errors in last hour
level:error AND @timestamp:[now-1h TO now]

# Find failed orders
event:order_payment_failed AND amount:>100

# Find slow requests
http.request.duration_ms:>1000

# Find by correlation ID
correlation_id:"abc-123-def"

# Aggregate error counts
{
  "aggs": {
    "errors_by_service": {
      "terms": {
        "field": "service.name"
      },
      "aggs": {
        "error_count": {
          "value_count": {
            "field": "level"
          }
        }
      }
    }
  }
}

Log Retention Strategy

# Index lifecycle management
log_retention_policy:
  hot_phase:
    duration: "7 days"
    actions:
      - rollover:
          max_size: "50GB"
          max_age: "1d"

  warm_phase:
    duration: "30 days"
    actions:
      - shrink:
          number_of_shards: 1
      - force_merge:
          max_num_segments: 1

  cold_phase:
    duration: "90 days"
    actions:
      - freeze

  delete_phase:
    min_age: "180 days"
    actions:
      - delete

Pillar 3: Distributed Tracing

What Are Traces?

Definition: End-to-end request paths through distributed systems, showing the causal relationship between operations.

Characteristics:

Shows request flow across services
Identifies bottlenecks
Visualizes dependencies
Samples high-volume traffic
Essential for microservices

Trace Anatomy

Trace ID: abc-123-def
├─ Span 1: HTTP GET /api/orders (200ms)
│  ├─ Span 2: Auth validation (10ms)
│  ├─ Span 3: Database query (50ms)
│  └─ Span 4: HTTP POST /payment-service (140ms)
│     ├─ Span 5: Validate card (20ms)
│     ├─ Span 6: Charge card (100ms) ← BOTTLENECK
│     └─ Span 7: Update ledger (20ms)

Trace waterfall visualization:

0ms        50ms       100ms      150ms      200ms
│          │          │          │          │
├──────────────────────────────────────────────┤ Span 1: GET /api/orders (200ms)
├──────┤                                        Span 2: Auth (10ms)
       ├──────────────────┤                     Span 3: DB query (50ms)
                           ├──────────────────────────────┤ Span 4: Payment (140ms)
                           ├────────┤                      Span 5: Validate (20ms)
                                    ├────────────────────────────┤ Span 6: Charge (100ms) ⚠️
                                                                  ├────────┤ Span 7: Ledger (20ms)

OpenTelemetry Implementation

1. Install OpenTelemetry

# Python
pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask opentelemetry-exporter-jaeger

# Node.js
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

2. Instrument Application

# Python Flask application with OpenTelemetry
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

app = Flask(__name__)

# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument HTTP requests
RequestsInstrumentor().instrument()

@app.route('/api/orders/<order_id>')
def get_order(order_id):
    # Automatically creates span for this request

    # Create custom span for specific operation
    with tracer.start_as_current_span("fetch_order_from_db") as span:
        span.set_attribute("order.id", order_id)

        order = db.query(f"SELECT * FROM orders WHERE id = {order_id}")

        span.set_attribute("order.status", order.status)
        span.set_attribute("order.amount", order.amount)

    # Make downstream request (automatically traced)
    payment_status = requests.get(f"http://payment-service/status/{order.payment_id}")

    return {
        "order": order,
        "payment": payment_status.json()
    }

3. Context Propagation

# Propagate trace context across services
from opentelemetry.propagate import inject

def call_downstream_service():
    headers = {}
    # Inject current trace context into headers
    inject(headers)

    # Make request with trace context
    response = requests.get(
        "http://downstream-service/api/endpoint",
        headers=headers
    )
    return response

4. Deploy Jaeger

# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:latest
        ports:
        - containerPort: 16686  # UI
        - containerPort: 6831   # UDP agent
        - containerPort: 14268  # HTTP collector
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
        - name: SPAN_STORAGE_TYPE
          value: "elasticsearch"
        - name: ES_SERVER_URLS
          value: "http://elasticsearch:9200"

Trace Analysis

Finding Slow Requests

# Jaeger UI query
Service: order-service
Operation: GET /api/orders
Min Duration: 500ms
Limit: 20

Results:
├─ Trace abc-123: 1.2s (SLOW)
│  └─ Bottleneck: payment-service charge_card (900ms)
├─ Trace def-456: 800ms
│  └─ Bottleneck: database query (600ms)
└─ Trace ghi-789: 650ms
   └─ Bottleneck: auth-service validate (400ms)

Identifying Error Chains

# Find traces with errors
has:error

# Trace shows error propagation:
Span 1: API Gateway [OK]
├─ Span 2: Order Service [OK]
   ├─ Span 3: Payment Service [ERROR: card_declined]
   └─ Span 4: Order Service [ERROR: payment_failed] ← Propagated
      └─ Span 5: Notification Service [OK: sent_failure_email]

Sampling Strategies

Head-Based Sampling (at root span)

# Sample 1% of all traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.01)  # 1%

Tail-Based Sampling (after trace completes)

# Sample based on trace characteristics
sampling_rules:
  - name: "Always sample errors"
    condition: "error == true"
    sample_rate: 1.0

  - name: "Always sample slow requests"
    condition: "duration > 1000ms"
    sample_rate: 1.0

  - name: "Sample 10% of normal requests"
    condition: "default"
    sample_rate: 0.1

Correlating the Three Pillars

Unified Correlation ID

import uuid
from contextvars import ContextVar

# Correlation ID shared across all three pillars
correlation_id_var = ContextVar('correlation_id', default=None)

def generate_correlation_id():
    return str(uuid.uuid4())

@app.before_request
def before_request():
    # Get or create correlation ID
    correlation_id = request.headers.get('X-Correlation-ID', generate_correlation_id())
    correlation_id_var.set(correlation_id)

    # Add to trace
    span = trace.get_current_span()
    span.set_attribute("correlation.id", correlation_id)

    # Add to logs
    structlog.contextvars.bind_contextvars(correlation_id=correlation_id)

    # Add to metrics labels (use sparingly - causes cardinality)
    # Better: use exemplars in Prometheus

@app.route('/api/orders')
def create_order():
    correlation_id = correlation_id_var.get()

    # Logs will include correlation_id
    log.info("order_created", order_id=123, amount=99.99)

    # Traces will include correlation_id attribute
    # Metrics can link to traces via exemplars

    return {"order_id": 123, "correlation_id": correlation_id}

Exemplars (Link Metrics to Traces)

# Prometheus exemplar support (Python)
from prometheus_client import Histogram

request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration',
    ['endpoint']
)

# Record metric with exemplar (trace ID)
request_duration.labels(endpoint='/api/orders').observe(
    0.5,
    exemplar={'trace_id': trace.get_current_span().get_span_context().trace_id}
)

In Grafana:

Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

When you click on a data point:
├─ Metric value: 450ms
└─ Exemplar link: "View trace abc-123-def" ← Click to see full trace in Jaeger

Debugging Workflow

1. Start with Metrics (Dashboard alert)

Alert: P95 latency > 1000ms for /api/orders endpoint

2. Check Logs (Find errors)

Kibana query: endpoint:"/api/orders" AND level:error AND @timestamp:[now-15m TO now]

Result: "payment_service_timeout" errors

3. View Traces (Find bottleneck)

Jaeger query: service=order-service operation=/api/orders errors=true

Result: 90% of time spent in payment-service HTTP call
└─ Root cause: payment-service database query taking 2+ seconds

4. Fix and Verify

Fix: Add database index

Verify:
├─ Metrics: P95 latency back to 200ms ✅
├─ Logs: No more timeout errors ✅
└─ Traces: Payment service now <100ms ✅

Best Practices

1. Cardinality Management

# ✅ Good: Low cardinality
metrics.labels(
    method='GET',
    endpoint='/api/users',  # Known endpoints only
    status_code='200'
)

# ❌ Bad: High cardinality (will explode storage)
metrics.labels(
    user_id='12345',     # Millions of users
    trace_id='abc...'    # Unique per request
)

2. Sampling Strategy

observability_sampling:
  metrics:
    sample_rate: 100%  # Always collect (cheap)

  logs:
    debug: 0%          # Only in development
    info: 100%         # Always
    warning: 100%      # Always
    error: 100%        # Always

  traces:
    normal_requests: 1%    # Sample 1%
    slow_requests: 100%    # Always trace slow requests
    errors: 100%           # Always trace errors

3. Cost Optimization

cost_optimization:
  retention:
    metrics:
      raw: "15 days"
      5m_aggregates: "90 days"
      1h_aggregates: "2 years"

    logs:
      debug: "1 day"
      info: "30 days"
      warning: "90 days"
      error: "1 year"

    traces:
      sampled: "7 days"
      errors: "30 days"

4. Alert Design

# Alert on symptoms (user-facing), not causes (internal)

# ✅ Good: User-facing metric
alert: HighErrorRate
expr: |
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
  > 0.01
for: 5m
annotations:
  summary: "5xx error rate above 1%"

# ❌ Bad: Internal metric (might not affect users)
alert: HighCPU
expr: cpu_usage > 80
for: 5m

Tools Comparison

Metrics

Tool	Pros	Cons	Best For
Prometheus	Open source, powerful queries, integrations	Scaling complexity	Kubernetes, cloud-native
Datadog	Easy setup, great UX	Expensive	Enterprises with budget
CloudWatch	Native AWS integration	AWS-only, limited queries	AWS-heavy environments

Logs

Tool	Pros	Cons	Best For
ELK Stack	Powerful search, flexible	Complex setup, resource-heavy	Large scale, complex queries
Loki	Lightweight, Prometheus-like	Less feature-rich	Cost-sensitive, Grafana users
CloudWatch Logs	AWS native	Expensive at scale	AWS environments

Traces

Tool	Pros	Cons	Best For
Jaeger	Open source, mature	Self-hosted complexity	Kubernetes, microservices
Zipkin	Simple, widely supported	Fewer features than Jaeger	Simple setups
Datadog APM	Integrated with metrics/logs	Expensive	All-in-one solution

Conclusion

The three pillars of observability work together to provide complete system understanding:

Metrics: Detect problems (what and when)
Logs: Investigate problems (why and context)
Traces: Locate problems (where in the flow)

Key Takeaways:

Start with metrics: Cheapest, easiest to alert on
Add structured logging: Essential for debugging
Implement tracing: Critical for distributed systems
Correlate everything: Use correlation IDs across all three
Sample intelligently: Balance cost and coverage
Optimize for your scale: Different tools for different sizes

Remember: “Observability is not a tool, it’s a property of your system. Build it in from day one.”

“If you can’t measure it, you can’t improve it. If you can’t debug it, you can’t fix it. If you can’t trace it, you can’t optimize it.”

Introduction#

The Three Pillars#

Overview#

Pillar 1: Metrics#

What Are Metrics?#

Types of Metrics#

1. Counters#

2. Gauges#

3. Histograms#

4. Summaries#

Metric Design Best Practices#

Good Metric Names#

Label Guidelines#

Implementing Metrics#

Prometheus + Grafana Stack#

Key Metrics to Track#

RED Method (for request-driven services)#

USE Method (for resources)#

Pillar 2: Logs#

What Are Logs?#

Log Levels#

Structured Logging#

Unstructured (Bad)#

Structured (Good)#

Log Aggregation Architecture#

ELK Stack Implementation#

1. Elasticsearch Deployment#

2. Filebeat Configuration#

3. Application Logging#

Log Searching and Analysis#

Kibana Query Examples#

Log Retention Strategy#

Pillar 3: Distributed Tracing#

What Are Traces?#

Trace Anatomy#

OpenTelemetry Implementation#

1. Install OpenTelemetry#

2. Instrument Application#

3. Context Propagation#

4. Deploy Jaeger#

Trace Analysis#

Finding Slow Requests#

Identifying Error Chains#

Sampling Strategies#

Head-Based Sampling (at root span)#

Tail-Based Sampling (after trace completes)#

Correlating the Three Pillars#

Unified Correlation ID#

Exemplars (Link Metrics to Traces)#

Debugging Workflow#

Best Practices#

1. Cardinality Management#

2. Sampling Strategy#

3. Cost Optimization#

4. Alert Design#

Tools Comparison#

Metrics#

Logs#

Traces#

Conclusion#

Introduction

The Three Pillars

Overview

Pillar 1: Metrics

What Are Metrics?

Types of Metrics

1. Counters

2. Gauges

3. Histograms

4. Summaries

Metric Design Best Practices

Good Metric Names

Label Guidelines

Implementing Metrics

Prometheus + Grafana Stack

Key Metrics to Track

RED Method (for request-driven services)

USE Method (for resources)

Pillar 2: Logs

What Are Logs?

Log Levels

Structured Logging

Unstructured (Bad)

Structured (Good)

Log Aggregation Architecture

ELK Stack Implementation

1. Elasticsearch Deployment

2. Filebeat Configuration

3. Application Logging

Log Searching and Analysis

Kibana Query Examples

Log Retention Strategy

Pillar 3: Distributed Tracing

What Are Traces?

Trace Anatomy

OpenTelemetry Implementation

1. Install OpenTelemetry

2. Instrument Application

3. Context Propagation

4. Deploy Jaeger

Trace Analysis

Finding Slow Requests

Identifying Error Chains

Sampling Strategies

Head-Based Sampling (at root span)

Tail-Based Sampling (after trace completes)

Correlating the Three Pillars

Unified Correlation ID

Exemplars (Link Metrics to Traces)

Debugging Workflow

Best Practices

1. Cardinality Management

2. Sampling Strategy

3. Cost Optimization

4. Alert Design

Tools Comparison

Metrics

Logs

Traces

Conclusion