Introduction
Prometheus queries can become slow and resource-intensive as your metrics scale. This guide covers PromQL optimization techniques, recording rules, and performance best practices to keep your monitoring fast and efficient.
PromQL Optimization
Understanding Query Performance
Factors affecting query performance:
- Number of time series matched
- Time range queried
- Query complexity
- Cardinality of labels
- Rate of data ingestion
Check query stats:
# Grafana: Enable query inspector
# Shows: Query time, series count, samples processed
1. Limit Time Series Selection
Bad (matches too many series):
# Matches ALL http_requests across all services
rate(http_requests_total[5m])
Good (specific label matching):
# Matches only specific service
rate(http_requests_total{service="api", environment="production"}[5m])
Label matching operators:
# Exact match
http_requests_total{method="GET"}
# Regex match (slower, use sparingly)
http_requests_total{method=~"GET|POST"}
# Negative match
http_requests_total{method!="OPTIONS"}
# Regex negative match
http_requests_total{method!~"OPTIONS|HEAD"}
2. Use Appropriate Time Ranges
Bad (unnecessary long range):
# Queries 1 hour of data for 5m rate
rate(http_requests_total{service="api"}[1h])
Good (minimal necessary range):
# 5m range sufficient for rate calculation
rate(http_requests_total{service="api"}[5m])
Guidelines:
rate()
/irate()
: Use 2-5x scrape intervalincrease()
: Match your actual time windowavg_over_time()
: Use appropriate window for smoothing
Examples:
# Scrape interval: 30s
# Good for rate: 2-5 minutes
rate(metric[2m]) # 4 scrape intervals
rate(metric[5m]) # 10 scrape intervals
# Bad: Too short (noisy)
rate(metric[30s]) # Only 1 interval
# Bad: Too long (slow, wasteful)
rate(metric[1h]) # 120 intervals unnecessary
3. Avoid High-Cardinality Labels
Bad (unbounded cardinality):
# user_id can be millions of values
http_requests_total{user_id="12345"}
# request_id unique per request
http_requests_total{request_id="abc-123-def"}
Good (bounded cardinality):
# Limited set of methods
http_requests_total{method="GET", path="/api/users"}
# Aggregated by status code
http_requests_total{status="200"}
Check cardinality:
# Count unique time series for a metric
count(http_requests_total)
# Count by label
count by (method) (http_requests_total)
# Total series in Prometheus
count({__name__=~".+"})
What is cardinality? It’s the number of unique time series. Each unique combination of metric name + all label values = 1 series.
Example of cardinality:
Metric: http_requests_total
Labels: method={GET, POST, PUT, DELETE}, status={200, 400, 500}
Cardinality = 4 methods Γ 3 statuses = 12 unique series
Names:
- http_requests_total{method="GET", status="200"}
- http_requests_total{method="GET", status="400"}
- ... (10 more combinations)
Why it matters:
- High cardinality = more memory needed
- High cardinality labels like
user_id
,request_id
can cause memory issues - Monitor cardinality to prevent system overload
4. Use Efficient Aggregations
Bad (aggregates then filters):
# Processes all series, then filters
sum(rate(http_requests_total[5m])) > 100
Good (filters then aggregates):
# Filters first, processes less data
sum(rate(http_requests_total{status=~"5.."}[5m]))
Aggregation operators:
# Fast aggregations (single pass)
sum(metric)
avg(metric)
min(metric)
max(metric)
count(metric)
# Expensive aggregations (sorting required)
topk(10, metric) # Top 10 values
bottomk(5, metric) # Bottom 5 values
quantile(0.95, metric) # 95th percentile
Why the difference?
- Fast (sum, avg, min, max, count): Process each series once, combine results. Time complexity: O(n)
- Expensive (topk, bottomk, quantile): Need to sort all series first. Time complexity: O(n log n)
Real-world performance:
Scenario: 50,000 series
sum() operation:
- Processes each series once
- Result: ~1ms on modern hardware
- Perfect for dashboards
topk(10) operation:
- Must compare all 50,000 series
- Needs sorting
- Result: ~200-500ms
- Can be slow on dashboards
Better alternative:
- Pre-compute with recording rule
- Store top 10 values daily
- Query is then instant
Optimize with grouping:
# Without grouping (processes all series)
sum(rate(http_requests_total[5m]))
# With grouping (reduces cardinality)
sum by (service, status) (rate(http_requests_total[5m]))
# Exclude labels (keep everything except)
sum without (instance, pod) (rate(http_requests_total[5m]))
5. Avoid Expensive Operations
Slow operations to minimize:
# Regex matching (especially in aggregations)
sum(rate({__name__=~"http_.*"}[5m])) # Slow
# Multiple joins
metric_a / on (label) metric_b / on (label) metric_c # Slow
# Many-to-many matching
metric_a * on (label) group_left() metric_b # Can be slow
Better alternatives:
# Use exact metric names when possible
sum(rate(http_requests_total[5m])) # Fast
# Simplify joins
metric_a / on (label) metric_b # Faster
# Use recording rules for complex queries
6. Use irate()
vs rate()
Appropriately
rate()
- Average rate over time window:
# Smoothed rate, good for alerts
rate(http_requests_total[5m])
# Less sensitive to spikes
# Better for steady metrics
What does rate() do?
- Calculates the average rate of change over the time window
- Uses all data points in the range
- Smooths out spikes and noise
- Better for alerting and dashboards showing trends
Example:
Metric: http_requests_total = [100, 150, 200, 250, 300] (every 1 minute)
Time window: 5m
rate() result:
- Total increase: 300 - 100 = 200 requests
- Time: 5 minutes
- Rate = 200 / 300 seconds = 0.667 req/sec
- This is smooth and averaged
irate()
- Instant rate (last 2 data points):
# High sensitivity, good for volatile metrics
irate(http_requests_total[5m])
# More responsive to changes
# Use for fast-changing counters
What does irate() do?
- Uses ONLY the last 2 data points in the range
- Ignores the rest of the time window
- Very responsive to changes
- Can be noisy/spiky
Example with same data:
Metric: http_requests_total = [100, 150, 200, 250, 300]
irate() result:
- Takes last 2 points: 250, 300
- Rate = (300 - 250) / 60 sec = 0.833 req/sec
- More reactive to latest trends
Visual comparison:
True traffic pattern:
Requests/sec
^
| β±β² β±β² β±β²
|β± β² β± β² β± β²
+ββββββββββββββββββ Time
rate() result (smoothed):
| β±β² β±β² β±β²
|β± β² β± β² β± β² (follows trend average)
irate() result (reactive):
| β±β²β±β²β±β²β±β²β±β²β±β²β±β²β±β² (jumpy, follows every change)
7. Optimize Subqueries
Bad (nested subqueries):
# Very expensive
rate(
rate(http_requests_total[5m])[10m:]
)
Good (use recording rules):
# Pre-compute inner query as recording rule
# Then query the recording rule
rate(http_requests:rate5m[10m])
Recording Rules
What are Recording Rules?
Pre-computed queries that run at regular intervals and store results as new metrics.
Benefits:
- Faster dashboard load times
- Reduced query complexity
- Lower resource usage
- Consistent calculations
Basic Recording Rule
prometheus.yml:
rule_files:
- /etc/prometheus/rules/*.yml
rules/http_requests.yml:
groups:
- name: http_request_rules
interval: 30s # Evaluation interval
rules:
# Recording rule
- record: http_requests:rate5m
expr: rate(http_requests_total[5m])
- record: http_requests:rate5m:sum_by_service
expr: sum by (service) (rate(http_requests_total[5m]))
Using recording rules:
# Instead of:
sum by (service) (rate(http_requests_total[5m]))
# Use:
http_requests:rate5m:sum_by_service
Naming Convention
Format: level:metric:operations
http_requests:rate5m # rate over 5m
http_requests:rate5m:sum_by_service # aggregated by service
http_requests:rate5m:sum_by_service_status # multiple labels
Complex Recording Rules
CPU usage percentage:
groups:
- name: cpu_rules
interval: 30s
rules:
# Step 1: Calculate rate
- record: node_cpu:rate1m
expr: rate(node_cpu_seconds_total[1m])
# Step 2: Calculate non-idle CPU
- record: node_cpu:usage_rate1m
expr: |
1 - sum by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[1m])
) / sum by (instance) (
rate(node_cpu_seconds_total[1m])
)
# Step 3: Convert to percentage
- record: node_cpu:usage_percent
expr: node_cpu:usage_rate1m * 100
Understanding CPU usage formula:
CPU Usage = 1 - (idle time / total time)
Why?
- node_cpu_seconds_total{mode="idle"} = CPU spent doing nothing
- All modes combined = total CPU time spent
- If idle = 50%, then usage = 100% - 50% = 50%
Example (over 1 minute):
- Total CPU: 60 seconds
- Idle: 30 seconds
- Usage: 1 - (30/60) = 1 - 0.5 = 0.5 = 50%
Why multiply by 100?
- Rate gives decimal (0.5)
- Multiply by 100 to get percentage (50%)
Application SLI:
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: http:availability:rate5m
expr: |
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI (requests under 300ms)
- record: http:latency_sli:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Error rate
- record: http:error_rate:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Understanding SLI formulas:
Availability SLI:
= Successful requests / Total requests
Why status=~"2..|3.." ?
- 2xx (200-299) = success responses
- 3xx (300-399) = redirects (still successful)
- 4xx (400-499) = client error (user's fault, not service)
- 5xx (500-599) = server error (service's fault, counts as failure)
Example:
- 1000 requests total
- 950 were 2xx/3xx
- 50 were 5xx
- Availability = 950 / 1000 = 95%
This means service was available 95% of the time
Latency SLI:
= Requests under threshold / Total requests
Why le="0.3" ?
- le = "less than or equal"
- 0.3 = 300 milliseconds threshold
- This measures % of requests that responded in <300ms
Example:
- 1000 requests total
- 950 responded in <300ms
- Latency SLI = 950 / 1000 = 95%
This means 95% of requests were fast (met SLO)
Error Rate SLI:
= Server errors / Total requests
Why status=~"5.." ?
- Only 5xx errors (500-599) count as service errors
- 4xx errors are client issues (not service fault)
Example:
- 1000 requests total
- 50 were 5xx errors
- Error rate = 50 / 1000 = 5% (error rate)
- Or: 95% success rate (inverse)
Combined SLI:
If ALL three are >95%, then service meets SLO
If ANY is <95%, investigate that dimension
When to Use Recording Rules
Good candidates:
- Queries used in multiple dashboards
- Complex aggregations
- Queries that time out
- Frequent alert evaluations
- SLI/SLO calculations
Example:
# This query is used in 5 dashboards and 3 alerts
# Perfect for recording rule
groups:
- name: pod_memory_rules
interval: 30s
rules:
- record: pod_memory:usage_bytes:sum_by_namespace
expr: sum by (namespace) (container_memory_usage_bytes)
Performance Tuning
1. Optimize Scrape Configuration
prometheus.yml:
global:
scrape_interval: 30s # Balance between freshness and load
scrape_timeout: 10s # Timeout for scrape
evaluation_interval: 30s # How often to evaluate rules
scrape_configs:
- job_name: 'kubernetes-pods'
scrape_interval: 15s # Override for important metrics
sample_limit: 10000 # Prevent scraping too many metrics
- job_name: 'slow-endpoints'
scrape_interval: 60s # Less frequent for slow targets
scrape_timeout: 30s
2. Metric Relabeling
Drop unnecessary metrics:
scrape_configs:
- job_name: 'kubernetes-pods'
metric_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'grpc_io_.*'
action: drop
# Drop debugging metrics
- source_labels: [__name__]
regex: 'debug_.*'
action: drop
# Keep only specific metrics
- source_labels: [__name__]
regex: '(http_requests_total|http_request_duration_seconds).*'
action: keep
Reduce label cardinality:
metric_relabel_configs:
# Remove high-cardinality labels
- regex: 'pod_id|container_id|request_id'
action: labeldrop
# Aggregate pod names to deployment
- source_labels: [pod]
target_label: deployment
regex: '(.*)-[0-9a-f]{10}-.*'
replacement: '${1}'
3. Retention Configuration
# Command line flags
storage:
tsdb:
path: /prometheus/data
retention:
time: 15d # Keep data for 15 days
size: 50GB # Or 50GB, whichever comes first
# Optimize for query performance
global:
query:
max_samples: 50000000 # Max samples per query
timeout: 2m # Query timeout
lookback_delta: 5m # How far back to look for samples
4. Resource Limits
Container resources:
# Kubernetes deployment
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
Memory formula:
Memory needed β (Active series Γ 2KB) + (Chunks Γ 12KB)
Example:
- 1M active series
- Memory: 1M Γ 2KB = 2GB
- Add overhead: ~4GB minimum
What is this? This formula estimates how much RAM Prometheus needs to store metrics in memory.
Why 2KB per series? Each time series in Prometheus (identified by metric name + label set) requires approximately 2KB of memory just for the index and metadata. For example, http_requests_total{service="api", method="GET", status="200"}
is one series.
What are chunks? Chunks are blocks of time-series data stored in memory. When data comes in, it’s first buffered in chunks (~1 hour of data each). One chunk needs about 12KB.
Real-world example:
Production cluster with:
- 5M active series (each combination of labels)
- ~10 chunks in memory (data buffering)
Calculation:
- Series memory: 5M Γ 2KB = 10GB
- Chunks memory: 10 Γ 12KB β 120KB (negligible)
- Overhead/margin: +50% = 5GB
- Total needed: 15GB RAM
So allocate: 16GB as safe minimum, 20GB as comfortable
Why this matters:
- If you set memory limit too low β Prometheus crashes
- If too high β wastes money on over-provisioning
- Use this formula to plan resource requests/limits
5. Query Optimization Flags
# prometheus.yml or command flags
--query.max-concurrency=20 # Concurrent queries
--query.timeout=2m # Query timeout
--storage.tsdb.min-block-duration=2h # Block duration
--storage.tsdb.max-block-duration=36h # Max block size
Query Best Practices
1. Dashboard Queries
Optimize for fast loading:
# Instead of querying raw metrics
http_requests_total
# Use recording rule
http_requests:rate5m:sum_by_service
# Limit time range
http_requests:rate5m:sum_by_service[6h]
# Use relative time ranges
http_requests:rate5m:sum_by_service[$__range] # Grafana variable
2. Alert Queries
Keep alerts simple:
groups:
- name: alerts
interval: 30s
rules:
# Good: Simple threshold on recording rule
- alert: HighErrorRate
expr: http:error_rate:rate5m > 0.05
for: 5m
# Bad: Complex query in alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
3. API Queries
Use instant queries when possible:
# Instant query (single timestamp)
curl 'http://prometheus:9090/api/v1/query?query=up'
# Range query (multiple timestamps, slower)
curl 'http://prometheus:9090/api/v1/query_range?query=up&start=...'
4. Avoid Common Pitfalls
Don’t:
# Query all metrics
{__name__=~".+"}
# Use unbounded regex
metric{label=~".*"}
# Aggregate high-cardinality metrics
sum(metric{user_id=~".+"})
# Query very long time ranges
metric[30d] # For rate calculations
Do:
# Query specific metrics
up{job="api"}
# Use exact matches
metric{label="value"}
# Use recording rules for aggregations
metric:aggregated
# Use appropriate time ranges
rate(metric[5m])
Monitoring Prometheus Performance
Common PromQL Commands & Metrics Glossary
Percentile Calculations (p-values):
P95 (95th Percentile):
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
What is it? 95% of requests responded faster than this value.
Example: P95 = 200ms
- Meaning: 95% of users saw response time β€ 200ms
- 5% saw slower responses
- Use case: SLO target “95% of requests <200ms”
P99 (99th Percentile):
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
What is it? 99% of requests responded faster than this value.
Example: P99 = 500ms
- Meaning: 99% of users saw response time β€ 500ms
- Only 1% saw slower responses
- Use case: Detecting performance issues for power users
Other common percentiles:
# P50 (Median) - 50% of requests faster than this
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# P90 - 90% of requests faster than this
histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m]))
# P99.9 - 99.9% of requests faster than this
histogram_quantile(0.999, rate(http_request_duration_seconds_bucket[5m]))
Why percentiles matter:
Average latency = 150ms (misleading!)
But reality:
- P50 (50%): 100ms (half are fast)
- P95 (95%): 200ms (most are ok)
- P99 (99%): 1000ms (some users suffer!)
- P99.9: 5000ms (rare but bad)
You need percentiles to see the real user experience
Rate Calculations:
rate() - Average rate of change:
rate(http_requests_total[5m])
Returns: Requests per second (averaged over 5 minutes)
irate() - Instant rate (reactive):
irate(http_requests_total[5m])
Returns: Requests per second (using last 2 data points)
increase() - Total increase:
increase(http_requests_total[5m])
Returns: Total number of requests in 5 minute window
Aggregation Functions:
sum() - Add all values:
sum(http_requests_total)
Use case: Total requests across all services
avg() - Average value:
avg(http_requests_total)
Use case: Average requests per instance
max() / min() - Highest / lowest:
max(http_request_duration_seconds)
min(http_request_duration_seconds)
Use case: Slowest/fastest response times
topk() - Top N values (expensive):
topk(10, http_requests_total)
Returns: Top 10 services by request count Warning: Slow operation, use sparingly
bottomk() - Bottom N values:
bottomk(5, http_requests_total)
Returns: 5 services with lowest request count
count() - Count series:
count(http_requests_total)
Returns: How many unique series match
count_values() - Group and count:
count_values("status", http_requests_total)
Returns: How many series have each status code
Label Operations:
Group by labels:
sum by (service, status) (http_requests_total)
Result: Separate sum for each service+status combination
Sum without labels:
sum without (instance, pod) (rate(http_requests_total[5m]))
Result: Remove instance/pod details, aggregate everything else
Regex label matching:
http_requests_total{service=~"api|web"} # GET or POST
http_requests_total{path!~"/health.*"} # Exclude health checks
http_requests_total{status=~"5.."} # All 5xx errors
Comparison Operators:
Threshold comparisons:
# Greater than
http_requests_total > 1000
# Less than or equal
http_request_duration_seconds <= 0.5
# Not equal
http_status != 200
Boolean operations:
# AND - both conditions
(http_status == 500) and (instance == "server-1")
# OR - either condition
(http_status == 500) or (http_status == 503)
# Unless - exclude
up unless (environment == "test")
Time Window Functions:
avg_over_time() - Average over time:
avg_over_time(cpu_usage[1h])
Returns: Average CPU for last 1 hour
max_over_time() - Maximum over time:
max_over_time(cpu_usage[1h])
Returns: Peak CPU usage in last 1 hour
min_over_time() - Minimum over time:
min_over_time(cpu_usage[1h])
Returns: Lowest CPU usage in last 1 hour
increase_over_time() - Total increase:
increase(errors_total[24h])
Returns: How many new errors in last 24 hours
Common Patterns:
Success rate (availability):
(sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100
Returns: % of successful requests
Error rate:
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100
Returns: % of server errors
Request latency SLI:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Returns: 95th percentile response time
CPU usage percentage:
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
Returns: CPU usage %, where 0 = idle, 100 = full
Memory usage percentage:
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Returns: Available memory %
Disk usage percentage:
((node_filesystem_size_bytes - node_filesystem_avail_bytes) /
node_filesystem_size_bytes) * 100
Returns: Disk usage %
Key Metrics
Query performance:
# Query duration
prometheus_engine_query_duration_seconds
# Slow queries (>1s)
histogram_quantile(0.99,
prometheus_engine_query_duration_seconds_bucket
) > 1
# Active queries
prometheus_engine_queries
Storage metrics:
# Active time series
prometheus_tsdb_head_series
# Chunks in memory
prometheus_tsdb_head_chunks
# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])
Resource usage:
# Memory usage
process_resident_memory_bytes
# CPU usage
rate(process_cpu_seconds_total[5m])
# Disk usage
prometheus_tsdb_storage_blocks_bytes
Grafana Dashboard Example
{
"dashboard": {
"title": "Prometheus Performance",
"panels": [
{
"title": "Query Duration (99th percentile)",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(prometheus_engine_query_duration_seconds_bucket[5m])) by (le))"
}]
},
{
"title": "Active Time Series",
"targets": [{
"expr": "prometheus_tsdb_head_series"
}]
},
{
"title": "Sample Ingestion Rate",
"targets": [{
"expr": "rate(prometheus_tsdb_head_samples_appended_total[5m])"
}]
},
{
"title": "Slow Queries (>1s)",
"targets": [{
"expr": "sum(rate(prometheus_engine_query_duration_seconds_count{slice=\"inner_eval\"}[5m])) by (query)"
}]
}
]
}
}
Advanced Techniques
1. Subquery Optimization
Use wisely:
# Subquery to get max over time
max_over_time(
rate(http_requests_total[5m])[1h:1m]
)
# Better: Use recording rule
max_over_time(http_requests:rate5m[1h])
2. Federation
Federate metrics across Prometheus instances:
# prometheus.yml (global Prometheus)
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # Only recording rules
static_configs:
- targets:
- 'prometheus-us:9090'
- 'prometheus-eu:9090'
3. Deduplication
Handle duplicate series:
# Multiple Prometheus instances scraping same targets
# Use deduplication
avg without(instance) (up{job="api"})
4. Metric Downsampling
For long-term storage (with Thanos/Cortex):
# Keep high-resolution data for 7d
# Downsample to 5m resolution after 7d
# Downsample to 1h resolution after 30d
Troubleshooting Slow Queries
Debug Process
1. Identify slow queries:
# Check query stats
topk(10,
prometheus_engine_query_duration_seconds{quantile="0.99"}
) by (query)
2. Check time series count:
# How many series does this match?
count(http_requests_total)
# Too many? Add label filters
count(http_requests_total{service="api"})
3. Reduce time range:
# Is range too long?
rate(http_requests_total[5m]) # Good
# vs
rate(http_requests_total[1h]) # Probably too long
4. Use recording rule:
# Convert slow query to recording rule
- record: http_requests:rate5m
expr: rate(http_requests_total[5m])
5. Check Prometheus logs:
# Look for slow query logs
kubectl logs prometheus-0 | grep "slow query"
Complete Example
Before Optimization
Slow dashboard query:
# Takes 10+ seconds
sum by (service, status) (
rate(
http_requests_total{
environment="production",
region=~"us-.*"
}[5m]
)
) /
sum by (service) (
rate(
http_requests_total{
environment="production",
region=~"us-.*"
}[5m]
)
)
After Optimization
recording_rules.yml:
groups:
- name: http_optimized
interval: 30s
rules:
# Step 1: Pre-filter metrics
- record: http_requests_prod:rate5m
expr: |
rate(
http_requests_total{
environment="production",
region=~"us-.*"
}[5m]
)
# Step 2: Aggregate
- record: http_requests_prod:rate5m:sum_by_service_status
expr: sum by (service, status) (http_requests_prod:rate5m)
- record: http_requests_prod:rate5m:sum_by_service
expr: sum by (service) (http_requests_prod:rate5m)
# Step 3: Calculate percentage
- record: http_requests_prod:status_percent
expr: |
http_requests_prod:rate5m:sum_by_service_status
/
http_requests_prod:rate5m:sum_by_service
Optimized dashboard query:
# Now takes <1 second
http_requests_prod:status_percent
Result:
- Query time: 10s β 0.5s (95% faster)
- Lower Prometheus CPU usage
- Consistent results across dashboards
Conclusion
Optimizing Prometheus queries requires:
- Efficient PromQL - Use specific labels, appropriate time ranges
- Recording rules - Pre-compute complex queries
- Performance tuning - Configure retention, limits, relabeling
- Monitoring - Track Prometheus performance metrics
- Best practices - Follow naming conventions, avoid pitfalls
Key takeaways:
- Filter early with label matching
- Use recording rules for complex/frequent queries
- Monitor your monitoring system
- Keep cardinality under control
- Use appropriate aggregation functions
- Test queries before deploying to dashboards
Well-optimized Prometheus queries ensure fast dashboards, reliable alerts, and efficient resource usage.