Introduction
Observability is the ability to understand the internal state of a system based on its external outputs. Unlike traditional monitoring, which tells you what is broken, observability helps you understand why it’s broken, even for issues you’ve never encountered before.
Core Principle: “You can’t fix what you can’t see. You can’t see what you don’t measure.”
The Three Pillars
Overview
βββββββββββββββββββββββββββββββββββββββββββ
β OBSERVABILITY β
βββββββββββββββ¬βββββββββββββββ¬βββββββββββββ€
β METRICS β LOGS β TRACES β
βββββββββββββββΌβββββββββββββββΌβββββββββββββ€
β What/When β Why/Details β Where β
β Aggregated β Individual β Causal β
β Time-series β Events β Flows β
β Dashboards β Search β Waterfall β
βββββββββββββββ΄βββββββββββββββ΄βββββββββββββ
When to Use Each:
Question | Pillar | Example |
---|---|---|
Is my API slow? | Metrics | “P95 latency is 500ms” |
Why is it slow? | Logs | “Database query timeout in order service” |
Where is the bottleneck? | Traces | “80% of time spent in payment API call” |
Pillar 1: Metrics
What Are Metrics?
Definition: Numerical measurements captured at regular intervals (time-series data).
Characteristics:
- Aggregated over time
- Constant storage size (fixed cardinality)
- Efficient for alerting
- Great for dashboards
- Shows trends and patterns
Types of Metrics
1. Counters
Always increasing (or reset to zero).
# Example: HTTP requests counter
from prometheus_client import Counter
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Increment counter
http_requests_total.labels(
method='GET',
endpoint='/api/orders',
status='200'
).inc()
Use cases:
- Total requests
- Total errors
- Bytes sent/received
PromQL queries:
# Request rate (per second)
rate(http_requests_total[5m])
# Errors per second
rate(http_requests_total{status=~"5.."}[5m])
# Total requests in last hour
increase(http_requests_total[1h])
2. Gauges
Value that can go up or down.
# Example: Active connections
from prometheus_client import Gauge
active_connections = Gauge(
'active_connections',
'Number of active database connections'
)
# Set gauge value
active_connections.set(42)
# Increment/decrement
active_connections.inc(5) # Add 5
active_connections.dec(2) # Remove 2
Use cases:
- Current memory usage
- Active connections
- Queue depth
- Temperature
PromQL queries:
# Current memory usage
process_resident_memory_bytes
# Average queue depth over 5 minutes
avg_over_time(queue_depth[5m])
# CPU usage percentage
rate(process_cpu_seconds_total[5m]) * 100
3. Histograms
Distribution of measurements (latency, size, etc.).
# Example: Request duration histogram
from prometheus_client import Histogram
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0] # Bucket boundaries
)
# Observe a value
with request_duration.labels(method='GET', endpoint='/api/users').time():
# Your code here
process_request()
Automatically generates:
_sum
: Total sum of all observations_count
: Total count of observations_bucket
: Count per bucket
PromQL queries:
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# P50 (median) latency
histogram_quantile(0.50,
rate(http_request_duration_seconds_bucket[5m])
)
# Average latency
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
4. Summaries
Pre-calculated percentiles (alternative to histograms).
# Example: Request size summary
from prometheus_client import Summary
request_size = Summary(
'http_request_size_bytes',
'HTTP request size',
['method']
)
request_size.labels(method='POST').observe(1024)
When to use:
- Histogram: Better for aggregation across instances
- Summary: Better for client-side percentiles
Metric Design Best Practices
Good Metric Names
# β
Good (clear, consistent naming)
http_requests_total
http_request_duration_seconds
database_connection_pool_active
cache_hits_total
# β Bad (unclear, inconsistent)
requests
latency_ms
db_conns
hits
Naming convention:
<namespace>_<subsystem>_<name>_<unit>
Examples:
- http_request_duration_seconds
- database_queries_total
- cache_size_bytes
Label Guidelines
# β
Good (low cardinality)
http_requests_total{
method="GET", # Limited values: GET, POST, etc.
endpoint="/api/users", # Limited values: known endpoints
status="200" # Limited values: HTTP status codes
}
# β Bad (high cardinality - will explode storage)
http_requests_total{
user_id="12345", # Unlimited values
session_id="abc...", # Unlimited values
timestamp="2025..." # Unlimited values
}
Label cardinality limits:
- Total unique label combinations: <10,000 per metric
- Values per label: <100 ideally
Implementing Metrics
Prometheus + Grafana Stack
1. Install Prometheus
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: prometheus-config
2. Configure Scraping
# prometheus-config.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
3. Instrument Application
# Flask application with Prometheus metrics
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
import time
app = Flask(__name__)
# Metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.route('/api/users')
def get_users():
start = time.time()
# Your logic here
users = fetch_users()
# Record metrics
duration = time.time() - start
REQUEST_DURATION.labels(method='GET', endpoint='/api/users').observe(duration)
REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
return users
@app.route('/metrics')
def metrics():
return generate_latest()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
4. Create Grafana Dashboard
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)"
}],
"type": "graph"
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
}],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
}],
"type": "stat"
}
]
}
}
Key Metrics to Track
RED Method (for request-driven services)
# Rate: Requests per second
sum(rate(http_requests_total[5m]))
# Errors: Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Duration: Latency percentiles
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
USE Method (for resources)
# Utilization: CPU usage
rate(process_cpu_seconds_total[5m]) * 100
# Saturation: Queue depth
queue_depth
# Errors: Error count
rate(errors_total[5m])
Pillar 2: Logs
What Are Logs?
Definition: Immutable, timestamped records of discrete events.
Characteristics:
- Event-driven (not sampled)
- Rich context (full details)
- High volume
- Searchable
- Debugging-focused
Log Levels
# Standard log levels (Python example)
import logging
logging.debug("Detailed diagnostic info") # Development only
logging.info("Informational messages") # Normal operations
logging.warning("Warning: potential issue") # Things to watch
logging.error("Error: something failed") # Errors
logging.critical("Critical: system unstable") # Severe issues
When to use each:
- DEBUG: Function entry/exit, variable values
- INFO: User login, job started, configuration loaded
- WARNING: Deprecated API used, retry attempted
- ERROR: Request failed, database unavailable
- CRITICAL: Service crash, data corruption
Structured Logging
Unstructured (Bad)
# β Hard to parse, search, analyze
logging.info("User [email protected] logged in from IP 192.168.1.100 at 2025-10-16 14:30:00")
Structured (Good)
# β
Easy to parse, search, analyze
import structlog
log = structlog.get_logger()
log.info(
"user_login",
user_email="[email protected]",
ip_address="192.168.1.100",
timestamp="2025-10-16T14:30:00Z",
user_agent="Mozilla/5.0..."
)
JSON output:
{
"event": "user_login",
"user_email": "[email protected]",
"ip_address": "192.168.1.100",
"timestamp": "2025-10-16T14:30:00Z",
"user_agent": "Mozilla/5.0...",
"level": "info",
"logger": "auth_service"
}
Log Aggregation Architecture
ββββββββββββββββ
β Application β ββ> Write logs to stdout/stderr
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Log Agent β ββ> Collect logs (Filebeat, Fluentd)
β (per node) β
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Aggregator β ββ> Process & enrich (Logstash)
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Storage β ββ> Store & index (Elasticsearch)
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Visualizationβ ββ> Search & analyze (Kibana)
ββββββββββββββββ
ELK Stack Implementation
1. Elasticsearch Deployment
# elasticsearch-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
env:
- name: cluster.name
value: "prod-logs"
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
2. Filebeat Configuration
# filebeat-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
spec:
selector:
matchLabels:
app: filebeat
template:
metadata:
labels:
app: filebeat
spec:
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:8.10.0
volumeMounts:
- name: config
mountPath: /usr/share/filebeat/filebeat.yml
subPath: filebeat.yml
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: filebeat-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
# filebeat-config.yaml
filebeat.inputs:
- type: container
paths:
- '/var/log/containers/*.log'
# Kubernetes enrichment
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
# Parse JSON logs
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "logs-%{[agent.version]}-%{+yyyy.MM.dd}"
setup.template.name: "logs"
setup.template.pattern: "logs-*"
3. Application Logging
# Python application with structured logging
import structlog
import sys
# Configure structlog
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
log = structlog.get_logger()
# Usage
def process_order(order_id, user_id, amount):
log.info(
"order_processing_started",
order_id=order_id,
user_id=user_id,
amount=amount
)
try:
# Process order
payment_result = charge_payment(user_id, amount)
log.info(
"order_completed",
order_id=order_id,
payment_id=payment_result.id,
duration_ms=payment_result.duration
)
except PaymentFailedException as e:
log.error(
"order_payment_failed",
order_id=order_id,
user_id=user_id,
amount=amount,
error=str(e),
error_code=e.code
)
raise
Log Searching and Analysis
Kibana Query Examples
# Find all errors in last hour
level:error AND @timestamp:[now-1h TO now]
# Find failed orders
event:order_payment_failed AND amount:>100
# Find slow requests
http.request.duration_ms:>1000
# Find by correlation ID
correlation_id:"abc-123-def"
# Aggregate error counts
{
"aggs": {
"errors_by_service": {
"terms": {
"field": "service.name"
},
"aggs": {
"error_count": {
"value_count": {
"field": "level"
}
}
}
}
}
}
Log Retention Strategy
# Index lifecycle management
log_retention_policy:
hot_phase:
duration: "7 days"
actions:
- rollover:
max_size: "50GB"
max_age: "1d"
warm_phase:
duration: "30 days"
actions:
- shrink:
number_of_shards: 1
- force_merge:
max_num_segments: 1
cold_phase:
duration: "90 days"
actions:
- freeze
delete_phase:
min_age: "180 days"
actions:
- delete
Pillar 3: Distributed Tracing
What Are Traces?
Definition: End-to-end request paths through distributed systems, showing the causal relationship between operations.
Characteristics:
- Shows request flow across services
- Identifies bottlenecks
- Visualizes dependencies
- Samples high-volume traffic
- Essential for microservices
Trace Anatomy
Trace ID: abc-123-def
ββ Span 1: HTTP GET /api/orders (200ms)
β ββ Span 2: Auth validation (10ms)
β ββ Span 3: Database query (50ms)
β ββ Span 4: HTTP POST /payment-service (140ms)
β ββ Span 5: Validate card (20ms)
β ββ Span 6: Charge card (100ms) β BOTTLENECK
β ββ Span 7: Update ledger (20ms)
Trace waterfall visualization:
0ms 50ms 100ms 150ms 200ms
β β β β β
ββββββββββββββββββββββββββββββββββββββββββββββββ€ Span 1: GET /api/orders (200ms)
ββββββββ€ Span 2: Auth (10ms)
ββββββββββββββββββββ€ Span 3: DB query (50ms)
ββββββββββββββββββββββββββββββββ€ Span 4: Payment (140ms)
ββββββββββ€ Span 5: Validate (20ms)
ββββββββββββββββββββββββββββββ€ Span 6: Charge (100ms) β οΈ
ββββββββββ€ Span 7: Ledger (20ms)
OpenTelemetry Implementation
1. Install OpenTelemetry
# Python
pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask opentelemetry-exporter-jaeger
# Node.js
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
2. Instrument Application
# Python Flask application with OpenTelemetry
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
app = Flask(__name__)
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument HTTP requests
RequestsInstrumentor().instrument()
@app.route('/api/orders/<order_id>')
def get_order(order_id):
# Automatically creates span for this request
# Create custom span for specific operation
with tracer.start_as_current_span("fetch_order_from_db") as span:
span.set_attribute("order.id", order_id)
order = db.query(f"SELECT * FROM orders WHERE id = {order_id}")
span.set_attribute("order.status", order.status)
span.set_attribute("order.amount", order.amount)
# Make downstream request (automatically traced)
payment_status = requests.get(f"http://payment-service/status/{order.payment_id}")
return {
"order": order,
"payment": payment_status.json()
}
3. Context Propagation
# Propagate trace context across services
from opentelemetry.propagate import inject
def call_downstream_service():
headers = {}
# Inject current trace context into headers
inject(headers)
# Make request with trace context
response = requests.get(
"http://downstream-service/api/endpoint",
headers=headers
)
return response
4. Deploy Jaeger
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686 # UI
- containerPort: 6831 # UDP agent
- containerPort: 14268 # HTTP collector
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch:9200"
Trace Analysis
Finding Slow Requests
# Jaeger UI query
Service: order-service
Operation: GET /api/orders
Min Duration: 500ms
Limit: 20
Results:
ββ Trace abc-123: 1.2s (SLOW)
β ββ Bottleneck: payment-service charge_card (900ms)
ββ Trace def-456: 800ms
β ββ Bottleneck: database query (600ms)
ββ Trace ghi-789: 650ms
ββ Bottleneck: auth-service validate (400ms)
Identifying Error Chains
# Find traces with errors
has:error
# Trace shows error propagation:
Span 1: API Gateway [OK]
ββ Span 2: Order Service [OK]
ββ Span 3: Payment Service [ERROR: card_declined]
ββ Span 4: Order Service [ERROR: payment_failed] β Propagated
ββ Span 5: Notification Service [OK: sent_failure_email]
Sampling Strategies
Head-Based Sampling (at root span)
# Sample 1% of all traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.01) # 1%
Tail-Based Sampling (after trace completes)
# Sample based on trace characteristics
sampling_rules:
- name: "Always sample errors"
condition: "error == true"
sample_rate: 1.0
- name: "Always sample slow requests"
condition: "duration > 1000ms"
sample_rate: 1.0
- name: "Sample 10% of normal requests"
condition: "default"
sample_rate: 0.1
Correlating the Three Pillars
Unified Correlation ID
import uuid
from contextvars import ContextVar
# Correlation ID shared across all three pillars
correlation_id_var = ContextVar('correlation_id', default=None)
def generate_correlation_id():
return str(uuid.uuid4())
@app.before_request
def before_request():
# Get or create correlation ID
correlation_id = request.headers.get('X-Correlation-ID', generate_correlation_id())
correlation_id_var.set(correlation_id)
# Add to trace
span = trace.get_current_span()
span.set_attribute("correlation.id", correlation_id)
# Add to logs
structlog.contextvars.bind_contextvars(correlation_id=correlation_id)
# Add to metrics labels (use sparingly - causes cardinality)
# Better: use exemplars in Prometheus
@app.route('/api/orders')
def create_order():
correlation_id = correlation_id_var.get()
# Logs will include correlation_id
log.info("order_created", order_id=123, amount=99.99)
# Traces will include correlation_id attribute
# Metrics can link to traces via exemplars
return {"order_id": 123, "correlation_id": correlation_id}
Exemplars (Link Metrics to Traces)
# Prometheus exemplar support (Python)
from prometheus_client import Histogram
request_duration = Histogram(
'http_request_duration_seconds',
'Request duration',
['endpoint']
)
# Record metric with exemplar (trace ID)
request_duration.labels(endpoint='/api/orders').observe(
0.5,
exemplar={'trace_id': trace.get_current_span().get_span_context().trace_id}
)
In Grafana:
Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
When you click on a data point:
ββ Metric value: 450ms
ββ Exemplar link: "View trace abc-123-def" β Click to see full trace in Jaeger
Debugging Workflow
1. Start with Metrics (Dashboard alert)
Alert: P95 latency > 1000ms for /api/orders endpoint
2. Check Logs (Find errors)
Kibana query: endpoint:"/api/orders" AND level:error AND @timestamp:[now-15m TO now]
Result: "payment_service_timeout" errors
3. View Traces (Find bottleneck)
Jaeger query: service=order-service operation=/api/orders errors=true
Result: 90% of time spent in payment-service HTTP call
ββ Root cause: payment-service database query taking 2+ seconds
4. Fix and Verify
Fix: Add database index
Verify:
ββ Metrics: P95 latency back to 200ms β
ββ Logs: No more timeout errors β
ββ Traces: Payment service now <100ms β
Best Practices
1. Cardinality Management
# β
Good: Low cardinality
metrics.labels(
method='GET',
endpoint='/api/users', # Known endpoints only
status_code='200'
)
# β Bad: High cardinality (will explode storage)
metrics.labels(
user_id='12345', # Millions of users
trace_id='abc...' # Unique per request
)
2. Sampling Strategy
observability_sampling:
metrics:
sample_rate: 100% # Always collect (cheap)
logs:
debug: 0% # Only in development
info: 100% # Always
warning: 100% # Always
error: 100% # Always
traces:
normal_requests: 1% # Sample 1%
slow_requests: 100% # Always trace slow requests
errors: 100% # Always trace errors
3. Cost Optimization
cost_optimization:
retention:
metrics:
raw: "15 days"
5m_aggregates: "90 days"
1h_aggregates: "2 years"
logs:
debug: "1 day"
info: "30 days"
warning: "90 days"
error: "1 year"
traces:
sampled: "7 days"
errors: "30 days"
4. Alert Design
# Alert on symptoms (user-facing), not causes (internal)
# β
Good: User-facing metric
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
annotations:
summary: "5xx error rate above 1%"
# β Bad: Internal metric (might not affect users)
alert: HighCPU
expr: cpu_usage > 80
for: 5m
Tools Comparison
Metrics
Tool | Pros | Cons | Best For |
---|---|---|---|
Prometheus | Open source, powerful queries, integrations | Scaling complexity | Kubernetes, cloud-native |
Datadog | Easy setup, great UX | Expensive | Enterprises with budget |
CloudWatch | Native AWS integration | AWS-only, limited queries | AWS-heavy environments |
Logs
Tool | Pros | Cons | Best For |
---|---|---|---|
ELK Stack | Powerful search, flexible | Complex setup, resource-heavy | Large scale, complex queries |
Loki | Lightweight, Prometheus-like | Less feature-rich | Cost-sensitive, Grafana users |
CloudWatch Logs | AWS native | Expensive at scale | AWS environments |
Traces
Tool | Pros | Cons | Best For |
---|---|---|---|
Jaeger | Open source, mature | Self-hosted complexity | Kubernetes, microservices |
Zipkin | Simple, widely supported | Fewer features than Jaeger | Simple setups |
Datadog APM | Integrated with metrics/logs | Expensive | All-in-one solution |
Conclusion
The three pillars of observability work together to provide complete system understanding:
- Metrics: Detect problems (what and when)
- Logs: Investigate problems (why and context)
- Traces: Locate problems (where in the flow)
Key Takeaways:
- Start with metrics: Cheapest, easiest to alert on
- Add structured logging: Essential for debugging
- Implement tracing: Critical for distributed systems
- Correlate everything: Use correlation IDs across all three
- Sample intelligently: Balance cost and coverage
- Optimize for your scale: Different tools for different sizes
Remember: “Observability is not a tool, it’s a property of your system. Build it in from day one.”
“If you can’t measure it, you can’t improve it. If you can’t debug it, you can’t fix it. If you can’t trace it, you can’t optimize it.”