Introduction

Prometheus queries can become slow and resource-intensive as your metrics scale. This guide covers PromQL optimization techniques, recording rules, and performance best practices to keep your monitoring fast and efficient.

PromQL Optimization

Understanding Query Performance

Factors affecting query performance:

  • Number of time series matched
  • Time range queried
  • Query complexity
  • Cardinality of labels
  • Rate of data ingestion

Check query stats:

# Grafana: Enable query inspector
# Shows: Query time, series count, samples processed

1. Limit Time Series Selection

Bad (matches too many series):

# Matches ALL http_requests across all services
rate(http_requests_total[5m])

Good (specific label matching):

# Matches only specific service
rate(http_requests_total{service="api", environment="production"}[5m])

Label matching operators:

# Exact match
http_requests_total{method="GET"}

# Regex match (slower, use sparingly)
http_requests_total{method=~"GET|POST"}

# Negative match
http_requests_total{method!="OPTIONS"}

# Regex negative match
http_requests_total{method!~"OPTIONS|HEAD"}

2. Use Appropriate Time Ranges

Bad (unnecessary long range):

# Queries 1 hour of data for 5m rate
rate(http_requests_total{service="api"}[1h])

Good (minimal necessary range):

# 5m range sufficient for rate calculation
rate(http_requests_total{service="api"}[5m])

Guidelines:

  • rate() / irate(): Use 2-5x scrape interval
  • increase(): Match your actual time window
  • avg_over_time(): Use appropriate window for smoothing

Examples:

# Scrape interval: 30s
# Good for rate: 2-5 minutes
rate(metric[2m])  # 4 scrape intervals
rate(metric[5m])  # 10 scrape intervals

# Bad: Too short (noisy)
rate(metric[30s])  # Only 1 interval

# Bad: Too long (slow, wasteful)
rate(metric[1h])   # 120 intervals unnecessary

3. Avoid High-Cardinality Labels

Bad (unbounded cardinality):

# user_id can be millions of values
http_requests_total{user_id="12345"}

# request_id unique per request
http_requests_total{request_id="abc-123-def"}

Good (bounded cardinality):

# Limited set of methods
http_requests_total{method="GET", path="/api/users"}

# Aggregated by status code
http_requests_total{status="200"}

Check cardinality:

# Count unique time series for a metric
count(http_requests_total)

# Count by label
count by (method) (http_requests_total)

# Total series in Prometheus
count({__name__=~".+"})

What is cardinality? It’s the number of unique time series. Each unique combination of metric name + all label values = 1 series.

Example of cardinality:

Metric: http_requests_total
Labels: method={GET, POST, PUT, DELETE}, status={200, 400, 500}

Cardinality = 4 methods Γ— 3 statuses = 12 unique series
Names:
  - http_requests_total{method="GET", status="200"}
  - http_requests_total{method="GET", status="400"}
  - ... (10 more combinations)

Why it matters:

  • High cardinality = more memory needed
  • High cardinality labels like user_id, request_id can cause memory issues
  • Monitor cardinality to prevent system overload

4. Use Efficient Aggregations

Bad (aggregates then filters):

# Processes all series, then filters
sum(rate(http_requests_total[5m])) > 100

Good (filters then aggregates):

# Filters first, processes less data
sum(rate(http_requests_total{status=~"5.."}[5m]))

Aggregation operators:

# Fast aggregations (single pass)
sum(metric)
avg(metric)
min(metric)
max(metric)
count(metric)

# Expensive aggregations (sorting required)
topk(10, metric)      # Top 10 values
bottomk(5, metric)    # Bottom 5 values
quantile(0.95, metric)  # 95th percentile

Why the difference?

  • Fast (sum, avg, min, max, count): Process each series once, combine results. Time complexity: O(n)
  • Expensive (topk, bottomk, quantile): Need to sort all series first. Time complexity: O(n log n)

Real-world performance:

Scenario: 50,000 series

sum() operation:
- Processes each series once
- Result: ~1ms on modern hardware
- Perfect for dashboards

topk(10) operation:
- Must compare all 50,000 series
- Needs sorting
- Result: ~200-500ms
- Can be slow on dashboards

Better alternative:
- Pre-compute with recording rule
- Store top 10 values daily
- Query is then instant

Optimize with grouping:

# Without grouping (processes all series)
sum(rate(http_requests_total[5m]))

# With grouping (reduces cardinality)
sum by (service, status) (rate(http_requests_total[5m]))

# Exclude labels (keep everything except)
sum without (instance, pod) (rate(http_requests_total[5m]))

5. Avoid Expensive Operations

Slow operations to minimize:

# Regex matching (especially in aggregations)
sum(rate({__name__=~"http_.*"}[5m]))  # Slow

# Multiple joins
metric_a / on (label) metric_b / on (label) metric_c  # Slow

# Many-to-many matching
metric_a * on (label) group_left() metric_b  # Can be slow

Better alternatives:

# Use exact metric names when possible
sum(rate(http_requests_total[5m]))  # Fast

# Simplify joins
metric_a / on (label) metric_b  # Faster

# Use recording rules for complex queries

6. Use irate() vs rate() Appropriately

rate() - Average rate over time window:

# Smoothed rate, good for alerts
rate(http_requests_total[5m])

# Less sensitive to spikes
# Better for steady metrics

What does rate() do?

  • Calculates the average rate of change over the time window
  • Uses all data points in the range
  • Smooths out spikes and noise
  • Better for alerting and dashboards showing trends

Example:

Metric: http_requests_total = [100, 150, 200, 250, 300]  (every 1 minute)
Time window: 5m

rate() result:
- Total increase: 300 - 100 = 200 requests
- Time: 5 minutes
- Rate = 200 / 300 seconds = 0.667 req/sec
- This is smooth and averaged

irate() - Instant rate (last 2 data points):

# High sensitivity, good for volatile metrics
irate(http_requests_total[5m])

# More responsive to changes
# Use for fast-changing counters

What does irate() do?

  • Uses ONLY the last 2 data points in the range
  • Ignores the rest of the time window
  • Very responsive to changes
  • Can be noisy/spiky

Example with same data:

Metric: http_requests_total = [100, 150, 200, 250, 300]

irate() result:
- Takes last 2 points: 250, 300
- Rate = (300 - 250) / 60 sec = 0.833 req/sec
- More reactive to latest trends

Visual comparison:

True traffic pattern:
  Requests/sec
  ^
  | β•±β•²    β•±β•²    β•±β•²
  |β•±  β•²  β•±  β•²  β•±  β•²
  +─────────────────→ Time

rate() result (smoothed):
  | β•±β•²    β•±β•²    β•±β•²
  |β•±  β•²  β•±  β•²  β•±  β•²  (follows trend average)

irate() result (reactive):
  | β•±β•²β•±β•²β•±β•²β•±β•²β•±β•²β•±β•²β•±β•²β•±β•²  (jumpy, follows every change)

7. Optimize Subqueries

Bad (nested subqueries):

# Very expensive
rate(
  rate(http_requests_total[5m])[10m:]
)

Good (use recording rules):

# Pre-compute inner query as recording rule
# Then query the recording rule
rate(http_requests:rate5m[10m])

Recording Rules

What are Recording Rules?

Pre-computed queries that run at regular intervals and store results as new metrics.

Benefits:

  • Faster dashboard load times
  • Reduced query complexity
  • Lower resource usage
  • Consistent calculations

Basic Recording Rule

prometheus.yml:

rule_files:
  - /etc/prometheus/rules/*.yml

rules/http_requests.yml:

groups:
  - name: http_request_rules
    interval: 30s  # Evaluation interval
    rules:
      # Recording rule
      - record: http_requests:rate5m
        expr: rate(http_requests_total[5m])

      - record: http_requests:rate5m:sum_by_service
        expr: sum by (service) (rate(http_requests_total[5m]))

Using recording rules:

# Instead of:
sum by (service) (rate(http_requests_total[5m]))

# Use:
http_requests:rate5m:sum_by_service

Naming Convention

Format: level:metric:operations

http_requests:rate5m                    # rate over 5m
http_requests:rate5m:sum_by_service     # aggregated by service
http_requests:rate5m:sum_by_service_status  # multiple labels

Complex Recording Rules

CPU usage percentage:

groups:
  - name: cpu_rules
    interval: 30s
    rules:
      # Step 1: Calculate rate
      - record: node_cpu:rate1m
        expr: rate(node_cpu_seconds_total[1m])

      # Step 2: Calculate non-idle CPU
      - record: node_cpu:usage_rate1m
        expr: |
          1 - sum by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[1m])
          ) / sum by (instance) (
            rate(node_cpu_seconds_total[1m])
          )

      # Step 3: Convert to percentage
      - record: node_cpu:usage_percent
        expr: node_cpu:usage_rate1m * 100

Understanding CPU usage formula:

CPU Usage = 1 - (idle time / total time)

Why?
- node_cpu_seconds_total{mode="idle"} = CPU spent doing nothing
- All modes combined = total CPU time spent
- If idle = 50%, then usage = 100% - 50% = 50%

Example (over 1 minute):
- Total CPU: 60 seconds
- Idle: 30 seconds  
- Usage: 1 - (30/60) = 1 - 0.5 = 0.5 = 50%

Why multiply by 100?
- Rate gives decimal (0.5)
- Multiply by 100 to get percentage (50%)

Application SLI:

groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: http:availability:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # Latency SLI (requests under 300ms)
      - record: http:latency_sli:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

      # Error rate
      - record: http:error_rate:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

Understanding SLI formulas:

Availability SLI:

= Successful requests / Total requests

Why status=~"2..|3.." ?
- 2xx (200-299) = success responses
- 3xx (300-399) = redirects (still successful)
- 4xx (400-499) = client error (user's fault, not service)
- 5xx (500-599) = server error (service's fault, counts as failure)

Example:
- 1000 requests total
- 950 were 2xx/3xx
- 50 were 5xx
- Availability = 950 / 1000 = 95%

This means service was available 95% of the time

Latency SLI:

= Requests under threshold / Total requests

Why le="0.3" ?
- le = "less than or equal"
- 0.3 = 300 milliseconds threshold
- This measures % of requests that responded in <300ms

Example:
- 1000 requests total
- 950 responded in <300ms
- Latency SLI = 950 / 1000 = 95%

This means 95% of requests were fast (met SLO)

Error Rate SLI:

= Server errors / Total requests

Why status=~"5.." ?
- Only 5xx errors (500-599) count as service errors
- 4xx errors are client issues (not service fault)

Example:
- 1000 requests total
- 50 were 5xx errors
- Error rate = 50 / 1000 = 5% (error rate)
- Or: 95% success rate (inverse)

Combined SLI:
If ALL three are >95%, then service meets SLO
If ANY is <95%, investigate that dimension

When to Use Recording Rules

Good candidates:

  • Queries used in multiple dashboards
  • Complex aggregations
  • Queries that time out
  • Frequent alert evaluations
  • SLI/SLO calculations

Example:

# This query is used in 5 dashboards and 3 alerts
# Perfect for recording rule
groups:
  - name: pod_memory_rules
    interval: 30s
    rules:
      - record: pod_memory:usage_bytes:sum_by_namespace
        expr: sum by (namespace) (container_memory_usage_bytes)

Performance Tuning

1. Optimize Scrape Configuration

prometheus.yml:

global:
  scrape_interval: 30s      # Balance between freshness and load
  scrape_timeout: 10s       # Timeout for scrape
  evaluation_interval: 30s  # How often to evaluate rules

scrape_configs:
  - job_name: 'kubernetes-pods'
    scrape_interval: 15s    # Override for important metrics
    sample_limit: 10000     # Prevent scraping too many metrics

  - job_name: 'slow-endpoints'
    scrape_interval: 60s    # Less frequent for slow targets
    scrape_timeout: 30s

2. Metric Relabeling

Drop unnecessary metrics:

scrape_configs:
  - job_name: 'kubernetes-pods'
    metric_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: 'grpc_io_.*'
        action: drop

      # Drop debugging metrics
      - source_labels: [__name__]
        regex: 'debug_.*'
        action: drop

      # Keep only specific metrics
      - source_labels: [__name__]
        regex: '(http_requests_total|http_request_duration_seconds).*'
        action: keep

Reduce label cardinality:

metric_relabel_configs:
  # Remove high-cardinality labels
  - regex: 'pod_id|container_id|request_id'
    action: labeldrop

  # Aggregate pod names to deployment
  - source_labels: [pod]
    target_label: deployment
    regex: '(.*)-[0-9a-f]{10}-.*'
    replacement: '${1}'

3. Retention Configuration

# Command line flags
storage:
  tsdb:
    path: /prometheus/data
    retention:
      time: 15d         # Keep data for 15 days
      size: 50GB        # Or 50GB, whichever comes first

# Optimize for query performance
global:
  query:
    max_samples: 50000000       # Max samples per query
    timeout: 2m                  # Query timeout
    lookback_delta: 5m          # How far back to look for samples

4. Resource Limits

Container resources:

# Kubernetes deployment
resources:
  requests:
    memory: "4Gi"
    cpu: "2"
  limits:
    memory: "8Gi"
    cpu: "4"

Memory formula:

Memory needed β‰ˆ (Active series Γ— 2KB) + (Chunks Γ— 12KB)

Example:
- 1M active series
- Memory: 1M Γ— 2KB = 2GB
- Add overhead: ~4GB minimum

What is this? This formula estimates how much RAM Prometheus needs to store metrics in memory.

Why 2KB per series? Each time series in Prometheus (identified by metric name + label set) requires approximately 2KB of memory just for the index and metadata. For example, http_requests_total{service="api", method="GET", status="200"} is one series.

What are chunks? Chunks are blocks of time-series data stored in memory. When data comes in, it’s first buffered in chunks (~1 hour of data each). One chunk needs about 12KB.

Real-world example:

Production cluster with:
- 5M active series (each combination of labels)
- ~10 chunks in memory (data buffering)

Calculation:
- Series memory: 5M Γ— 2KB = 10GB
- Chunks memory: 10 Γ— 12KB β‰ˆ 120KB (negligible)
- Overhead/margin: +50% = 5GB
- Total needed: 15GB RAM

So allocate: 16GB as safe minimum, 20GB as comfortable

Why this matters:

  • If you set memory limit too low β†’ Prometheus crashes
  • If too high β†’ wastes money on over-provisioning
  • Use this formula to plan resource requests/limits

5. Query Optimization Flags

# prometheus.yml or command flags
--query.max-concurrency=20        # Concurrent queries
--query.timeout=2m                # Query timeout
--storage.tsdb.min-block-duration=2h   # Block duration
--storage.tsdb.max-block-duration=36h  # Max block size

Query Best Practices

1. Dashboard Queries

Optimize for fast loading:

# Instead of querying raw metrics
http_requests_total

# Use recording rule
http_requests:rate5m:sum_by_service

# Limit time range
http_requests:rate5m:sum_by_service[6h]

# Use relative time ranges
http_requests:rate5m:sum_by_service[$__range]  # Grafana variable

2. Alert Queries

Keep alerts simple:

groups:
  - name: alerts
    interval: 30s
    rules:
      # Good: Simple threshold on recording rule
      - alert: HighErrorRate
        expr: http:error_rate:rate5m > 0.05
        for: 5m

      # Bad: Complex query in alert
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m

3. API Queries

Use instant queries when possible:

# Instant query (single timestamp)
curl 'http://prometheus:9090/api/v1/query?query=up'

# Range query (multiple timestamps, slower)
curl 'http://prometheus:9090/api/v1/query_range?query=up&start=...'

4. Avoid Common Pitfalls

Don’t:

# Query all metrics
{__name__=~".+"}

# Use unbounded regex
metric{label=~".*"}

# Aggregate high-cardinality metrics
sum(metric{user_id=~".+"})

# Query very long time ranges
metric[30d]  # For rate calculations

Do:

# Query specific metrics
up{job="api"}

# Use exact matches
metric{label="value"}

# Use recording rules for aggregations
metric:aggregated

# Use appropriate time ranges
rate(metric[5m])

Monitoring Prometheus Performance

Common PromQL Commands & Metrics Glossary

Percentile Calculations (p-values):

P95 (95th Percentile):

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

What is it? 95% of requests responded faster than this value.

Example: P95 = 200ms

  • Meaning: 95% of users saw response time ≀ 200ms
  • 5% saw slower responses
  • Use case: SLO target “95% of requests <200ms”

P99 (99th Percentile):

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

What is it? 99% of requests responded faster than this value.

Example: P99 = 500ms

  • Meaning: 99% of users saw response time ≀ 500ms
  • Only 1% saw slower responses
  • Use case: Detecting performance issues for power users

Other common percentiles:

# P50 (Median) - 50% of requests faster than this
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# P90 - 90% of requests faster than this  
histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m]))

# P99.9 - 99.9% of requests faster than this
histogram_quantile(0.999, rate(http_request_duration_seconds_bucket[5m]))

Why percentiles matter:

Average latency = 150ms (misleading!)

But reality:
- P50 (50%): 100ms (half are fast)
- P95 (95%): 200ms (most are ok)
- P99 (99%): 1000ms (some users suffer!)
- P99.9: 5000ms (rare but bad)

You need percentiles to see the real user experience

Rate Calculations:

rate() - Average rate of change:

rate(http_requests_total[5m])

Returns: Requests per second (averaged over 5 minutes)

irate() - Instant rate (reactive):

irate(http_requests_total[5m])

Returns: Requests per second (using last 2 data points)

increase() - Total increase:

increase(http_requests_total[5m])

Returns: Total number of requests in 5 minute window


Aggregation Functions:

sum() - Add all values:

sum(http_requests_total)

Use case: Total requests across all services

avg() - Average value:

avg(http_requests_total)

Use case: Average requests per instance

max() / min() - Highest / lowest:

max(http_request_duration_seconds)
min(http_request_duration_seconds)

Use case: Slowest/fastest response times

topk() - Top N values (expensive):

topk(10, http_requests_total)

Returns: Top 10 services by request count Warning: Slow operation, use sparingly

bottomk() - Bottom N values:

bottomk(5, http_requests_total)

Returns: 5 services with lowest request count

count() - Count series:

count(http_requests_total)

Returns: How many unique series match

count_values() - Group and count:

count_values("status", http_requests_total)

Returns: How many series have each status code


Label Operations:

Group by labels:

sum by (service, status) (http_requests_total)

Result: Separate sum for each service+status combination

Sum without labels:

sum without (instance, pod) (rate(http_requests_total[5m]))

Result: Remove instance/pod details, aggregate everything else

Regex label matching:

http_requests_total{service=~"api|web"}      # GET or POST
http_requests_total{path!~"/health.*"}       # Exclude health checks
http_requests_total{status=~"5.."}           # All 5xx errors

Comparison Operators:

Threshold comparisons:

# Greater than
http_requests_total > 1000

# Less than or equal
http_request_duration_seconds <= 0.5

# Not equal
http_status != 200

Boolean operations:

# AND - both conditions
(http_status == 500) and (instance == "server-1")

# OR - either condition
(http_status == 500) or (http_status == 503)

# Unless - exclude
up unless (environment == "test")

Time Window Functions:

avg_over_time() - Average over time:

avg_over_time(cpu_usage[1h])

Returns: Average CPU for last 1 hour

max_over_time() - Maximum over time:

max_over_time(cpu_usage[1h])

Returns: Peak CPU usage in last 1 hour

min_over_time() - Minimum over time:

min_over_time(cpu_usage[1h])

Returns: Lowest CPU usage in last 1 hour

increase_over_time() - Total increase:

increase(errors_total[24h])

Returns: How many new errors in last 24 hours


Common Patterns:

Success rate (availability):

(sum(rate(http_requests_total{status=~"2.."}[5m])) / 
 sum(rate(http_requests_total[5m]))) * 100

Returns: % of successful requests

Error rate:

(sum(rate(http_requests_total{status=~"5.."}[5m])) / 
 sum(rate(http_requests_total[5m]))) * 100

Returns: % of server errors

Request latency SLI:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Returns: 95th percentile response time

CPU usage percentage:

(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100

Returns: CPU usage %, where 0 = idle, 100 = full

Memory usage percentage:

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Returns: Available memory %

Disk usage percentage:

((node_filesystem_size_bytes - node_filesystem_avail_bytes) / 
 node_filesystem_size_bytes) * 100

Returns: Disk usage %


Key Metrics

Query performance:

# Query duration
prometheus_engine_query_duration_seconds

# Slow queries (>1s)
histogram_quantile(0.99,
  prometheus_engine_query_duration_seconds_bucket
) > 1

# Active queries
prometheus_engine_queries

Storage metrics:

# Active time series
prometheus_tsdb_head_series

# Chunks in memory
prometheus_tsdb_head_chunks

# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])

Resource usage:

# Memory usage
process_resident_memory_bytes

# CPU usage
rate(process_cpu_seconds_total[5m])

# Disk usage
prometheus_tsdb_storage_blocks_bytes

Grafana Dashboard Example

{
  "dashboard": {
    "title": "Prometheus Performance",
    "panels": [
      {
        "title": "Query Duration (99th percentile)",
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(prometheus_engine_query_duration_seconds_bucket[5m])) by (le))"
        }]
      },
      {
        "title": "Active Time Series",
        "targets": [{
          "expr": "prometheus_tsdb_head_series"
        }]
      },
      {
        "title": "Sample Ingestion Rate",
        "targets": [{
          "expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])"
        }]
      },
      {
        "title": "Slow Queries (>1s)",
        "targets": [{
          "expr": "sum(rate(prometheus_engine_query_duration_seconds_count{slice=\"inner_eval\"}[5m])) by (query)"
        }]
      }
    ]
  }
}

Advanced Techniques

1. Subquery Optimization

Use wisely:

# Subquery to get max over time
max_over_time(
  rate(http_requests_total[5m])[1h:1m]
)

# Better: Use recording rule
max_over_time(http_requests:rate5m[1h])

2. Federation

Federate metrics across Prometheus instances:

# prometheus.yml (global Prometheus)
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'  # Only recording rules
    static_configs:
      - targets:
        - 'prometheus-us:9090'
        - 'prometheus-eu:9090'

3. Deduplication

Handle duplicate series:

# Multiple Prometheus instances scraping same targets
# Use deduplication
avg without(instance) (up{job="api"})

4. Metric Downsampling

For long-term storage (with Thanos/Cortex):

# Keep high-resolution data for 7d
# Downsample to 5m resolution after 7d
# Downsample to 1h resolution after 30d

Troubleshooting Slow Queries

Debug Process

1. Identify slow queries:

# Check query stats
topk(10,
  prometheus_engine_query_duration_seconds{quantile="0.99"}
) by (query)

2. Check time series count:

# How many series does this match?
count(http_requests_total)

# Too many? Add label filters
count(http_requests_total{service="api"})

3. Reduce time range:

# Is range too long?
rate(http_requests_total[5m])  # Good

# vs
rate(http_requests_total[1h])  # Probably too long

4. Use recording rule:

# Convert slow query to recording rule
- record: http_requests:rate5m
  expr: rate(http_requests_total[5m])

5. Check Prometheus logs:

# Look for slow query logs
kubectl logs prometheus-0 | grep "slow query"

Complete Example

Before Optimization

Slow dashboard query:

# Takes 10+ seconds
sum by (service, status) (
  rate(
    http_requests_total{
      environment="production",
      region=~"us-.*"
    }[5m]
  )
) /
sum by (service) (
  rate(
    http_requests_total{
      environment="production",
      region=~"us-.*"
    }[5m]
  )
)

After Optimization

recording_rules.yml:

groups:
  - name: http_optimized
    interval: 30s
    rules:
      # Step 1: Pre-filter metrics
      - record: http_requests_prod:rate5m
        expr: |
          rate(
            http_requests_total{
              environment="production",
              region=~"us-.*"
            }[5m]
          )

      # Step 2: Aggregate
      - record: http_requests_prod:rate5m:sum_by_service_status
        expr: sum by (service, status) (http_requests_prod:rate5m)

      - record: http_requests_prod:rate5m:sum_by_service
        expr: sum by (service) (http_requests_prod:rate5m)

      # Step 3: Calculate percentage
      - record: http_requests_prod:status_percent
        expr: |
          http_requests_prod:rate5m:sum_by_service_status
          /
          http_requests_prod:rate5m:sum_by_service

Optimized dashboard query:

# Now takes <1 second
http_requests_prod:status_percent

Result:

  • Query time: 10s β†’ 0.5s (95% faster)
  • Lower Prometheus CPU usage
  • Consistent results across dashboards

Conclusion

Optimizing Prometheus queries requires:

  1. Efficient PromQL - Use specific labels, appropriate time ranges
  2. Recording rules - Pre-compute complex queries
  3. Performance tuning - Configure retention, limits, relabeling
  4. Monitoring - Track Prometheus performance metrics
  5. Best practices - Follow naming conventions, avoid pitfalls

Key takeaways:

  • Filter early with label matching
  • Use recording rules for complex/frequent queries
  • Monitor your monitoring system
  • Keep cardinality under control
  • Use appropriate aggregation functions
  • Test queries before deploying to dashboards

Well-optimized Prometheus queries ensure fast dashboards, reliable alerts, and efficient resource usage.