Introduction
The ELK stack (Elasticsearch, Logstash, Kibana) is powerful for log aggregation and analysis, but requires proper tuning for production workloads. This guide covers Elasticsearch index lifecycle management, Logstash pipeline optimization, and performance best practices.
Elasticsearch Index Lifecycle Management (ILM)
Understanding ILM
ILM automates index management through lifecycle phases:
Phases:
- Hot - Actively writing and querying
- Warm - No longer writing, still querying
- Cold - Rarely queried, compressed
- Frozen - Very rarely queried, minimal resources
- Delete - Removed from cluster
Basic ILM Policy
Create policy:
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "7d",
"max_primary_shard_size": "50gb"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "backup-repo"
},
"set_priority": {
"priority": 0
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Production ILM Policy
High-volume logs:
PUT _ilm/policy/production-logs
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_primary_shard_size": "30gb",
"max_docs": 10000000
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"readonly": {},
"allocate": {
"require": {
"data": "warm"
}
},
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "14d",
"actions": {
"allocate": {
"require": {
"data": "cold"
}
},
"freeze": {}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
Index Template with ILM
Create index template:
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "production-logs",
"index.lifecycle.rollover_alias": "logs-write",
"codec": "best_compression"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"level": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"service": {
"type": "keyword"
},
"environment": {
"type": "keyword"
}
}
}
}
}
Bootstrap first index:
PUT logs-000001
{
"aliases": {
"logs-write": {
"is_write_index": true
}
}
}
Node Roles for Hot-Warm-Cold
elasticsearch.yml (hot node):
node.roles: [master, data_hot, ingest]
node.attr.data: hot
elasticsearch.yml (warm node):
node.roles: [data_warm]
node.attr.data: warm
elasticsearch.yml (cold node):
node.roles: [data_cold]
node.attr.data: cold
Monitoring ILM
Check policy:
# Get policy
GET _ilm/policy/production-logs
# Check index lifecycle status
GET logs-*/_ilm/explain
# Check specific index
GET logs-2025.10.15/_ilm/explain
Common ILM issues:
# Index stuck in a phase
GET logs-*/_ilm/explain?filter_path=*.step,*.failed_step
# Retry failed step
POST logs-2025.10.15/_ilm/retry
Elasticsearch Performance Tuning
1. Shard Sizing
Guidelines:
- Shard size: 10-50 GB
- Max shards per node: 20 per GB of heap
- Avoid over-sharding
Calculate shards:
# Daily index calculation
daily_data_gb = 100 # GB per day
shard_size_gb = 30 # Target shard size
shards_needed = daily_data_gb / shard_size_gb
# Result: ~3 shards
replicas = 1 # One replica for HA
total_shards = shards_needed * (1 + replicas)
# Result: 6 shards total (3 primary + 3 replica)
Update shard count:
PUT _index_template/logs-template
{
"template": {
"settings": {
"number_of_shards": 3, // Calculated above
"number_of_replicas": 1
}
}
}
2. Memory and Heap
JVM heap sizing:
# elasticsearch.yml or jvm.options
# Set heap to 50% of RAM, max 32GB
# 64GB RAM server
-Xms30g
-Xmx30g
# 16GB RAM server
-Xms8g
-Xmx8g
Why not more than 32GB:
- Compressed object pointers disabled >32GB
- Less efficient memory usage
- Better to scale horizontally
Memory formula:
Total RAM = (JVM Heap × 2) + OS overhead
Example:
- JVM Heap: 30GB
- OS/Cache: 30GB + 4GB
- Total: 64GB RAM
3. Query Optimization
Use filters instead of queries:
// Slow (scored)
GET logs-*/_search
{
"query": {
"match": {
"level": "ERROR"
}
}
}
// Fast (filtered, cached)
GET logs-*/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"level": "ERROR"
}
}
]
}
}
}
Use date ranges:
GET logs-*/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1h",
"lte": "now"
}
}
}
]
}
}
}
Limit results:
GET logs-*/_search
{
"size": 100, // Limit results
"from": 0, // Pagination
"_source": ["@timestamp", "message", "level"], // Only needed fields
"query": {
"match_all": {}
}
}
4. Mapping Optimization
Disable _source for metrics:
PUT metrics-*/_mapping
{
"_source": {
"enabled": false
},
"properties": {
"cpu_usage": {
"type": "float",
"index": false, // Not searchable, only for aggregations
"doc_values": true
}
}
}
Use appropriate field types:
{
"mappings": {
"properties": {
"ip_address": {
"type": "ip" // Not text
},
"status_code": {
"type": "short" // Not integer
},
"is_error": {
"type": "boolean" // Not keyword
},
"user_id": {
"type": "keyword" // Not text (exact match)
},
"message": {
"type": "text", // Analyzed text
"index": true
},
"tags": {
"type": "keyword" // Array of keywords
}
}
}
}
Disable indexing for unused fields:
{
"mappings": {
"properties": {
"raw_log": {
"type": "text",
"index": false, // Don't index, only store
"store": true
}
}
}
}
5. Bulk Indexing
Optimize bulk requests:
POST _bulk
{"index": {"_index": "logs-write"}}
{"@timestamp": "2025-10-15T10:00:00", "level": "INFO", "message": "Log 1"}
{"index": {"_index": "logs-write"}}
{"@timestamp": "2025-10-15T10:00:01", "level": "ERROR", "message": "Log 2"}
Bulk sizing:
# Optimal bulk size: 5-15 MB
# Test to find sweet spot
# Too small: Overhead
# Too large: Memory pressure, timeouts
Bulk settings:
# elasticsearch.yml
thread_pool.write.queue_size: 1000
Logstash Pipeline Optimization
1. Pipeline Configuration
logstash.yml:
# Worker threads (# of CPU cores)
pipeline.workers: 8
# Batch size (events per batch)
pipeline.batch.size: 125
# Batch delay (ms)
pipeline.batch.delay: 50
# Enable persistent queue
queue.type: persisted
queue.max_bytes: 1gb
2. Input Optimization
Beats input:
input {
beats {
port => 5044
host => "0.0.0.0"
# Connection settings
client_inactivity_timeout => 300
# Enable compression
ssl => false # Use if needed
}
}
Kafka input (high throughput):
input {
kafka {
bootstrap_servers => "kafka:9092"
topics => ["logs"]
group_id => "logstash"
# Performance
consumer_threads => 4
fetch_min_bytes => "1024"
fetch_max_wait_ms => "500"
# Batch processing
max_poll_records => "500"
codec => json
}
}
3. Filter Optimization
Efficient grok patterns:
filter {
# Bad: Too many grok attempts
grok {
match => {
"message" => [
"%{PATTERN1}",
"%{PATTERN2}",
"%{PATTERN3}",
"%{PATTERN4}"
]
}
}
# Good: Pre-filter before grok
if [service] == "nginx" {
grok {
match => {
"message" => "%{NGINX_ACCESS_LOG}"
}
}
} else if [service] == "app" {
grok {
match => {
"message" => "%{APP_LOG}"
}
}
}
}
Custom patterns:
# patterns/nginx
NGINX_ACCESS_LOG %{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}" %{INT:status} %{INT:body_bytes_sent}
Use dissect for simple parsing:
filter {
# Dissect (faster than grok)
dissect {
mapping => {
"message" => "%{timestamp} %{level} %{service} %{msg}"
}
}
# Instead of grok
# grok {
# match => {
# "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{WORD:service} %{GREEDYDATA:msg}"
# }
# }
}
Conditional processing:
filter {
# Skip debug logs in production
if [level] == "DEBUG" and [environment] == "production" {
drop {}
}
# Only parse errors with grok
if [level] in ["ERROR", "FATAL"] {
grok {
match => {
"message" => "%{STACK_TRACE}"
}
}
}
# Add tags for routing
if [status_code] >= 500 {
mutate {
add_tag => ["error"]
add_field => {
"severity" => "high"
}
}
}
}
4. Output Optimization
Elasticsearch output:
output {
elasticsearch {
hosts => ["es1:9200", "es2:9200", "es3:9200"]
# Index naming
index => "logs-%{[@metadata][environment]}-%{+YYYY.MM.dd}"
# Performance
workers => 4
bulk_path => "/_bulk"
# Retry settings
retry_on_conflict => 3
retry_max_interval => 5
# Connection pool
pool_max => 500
pool_max_per_route => 100
# Template management
manage_template => true
template_name => "logs"
template_overwrite => true
}
}
Dead letter queue:
output {
elasticsearch {
hosts => ["es:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
# Send failed events to DLQ
if "_grokparsefailure" in [tags] {
file {
path => "/var/log/logstash/failed-events.log"
codec => json_lines
}
}
}
5. Multiple Pipelines
pipelines.yml:
- pipeline.id: nginx-logs
path.config: "/etc/logstash/conf.d/nginx.conf"
pipeline.workers: 4
queue.type: persisted
- pipeline.id: app-logs
path.config: "/etc/logstash/conf.d/app.conf"
pipeline.workers: 8
queue.type: persisted
- pipeline.id: metrics
path.config: "/etc/logstash/conf.d/metrics.conf"
pipeline.workers: 2
Benefits:
- Isolate workloads
- Different configurations per pipeline
- Better resource allocation
- Easier troubleshooting
6. Monitoring Logstash
Monitoring API:
# Pipeline stats
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'
# JVM stats
curl -XGET 'localhost:9600/_node/stats/jvm?pretty'
# Hot threads
curl -XGET 'localhost:9600/_node/hot_threads?pretty'
Key metrics:
{
"pipeline": {
"events": {
"in": 1000000,
"filtered": 1000000,
"out": 950000,
"duration_in_millis": 5000
},
"queue": {
"events": 500,
"max_queue_size_in_bytes": 1073741824
}
}
}
Grafana dashboard queries:
# Events per second
rate(logstash_events_in[5m])
# Events filtered (parsing)
rate(logstash_events_filtered[5m])
# Events out (to Elasticsearch)
rate(logstash_events_out[5m])
# Queue size
logstash_queue_events
Filebeat Configuration
Efficient Filebeat Setup
filebeat.yml:
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
# Multiline logs (stack traces)
multiline.type: pattern
multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
multiline.negate: true
multiline.match: after
# Fields
fields:
service: myapp
environment: production
fields_under_root: true
# Performance
close_inactive: 5m
clean_removed: true
# Exclude
exclude_lines: ['^DEBUG']
# Processors
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'
- add_host_metadata:
netinfo.enabled: false
- add_kubernetes_metadata:
in_cluster: true
# Output
output.logstash:
hosts: ["logstash:5044"]
# Load balancing
loadbalance: true
# Bulk settings
bulk_max_size: 2048
# Compression
compression_level: 3
# Worker threads
worker: 2
# Monitoring
monitoring.enabled: true
Complete Example Pipeline
Application Logs
logstash/app-logs.conf:
input {
beats {
port => 5044
client_inactivity_timeout => 300
}
}
filter {
# Parse JSON logs
if [message] =~ /^\{/ {
json {
source => "message"
target => "log"
}
}
# Parse timestamp
date {
match => ["[log][timestamp]", "ISO8601"]
target => "@timestamp"
}
# Extract level
mutate {
add_field => {
"level" => "%{[log][level]}"
"service" => "%{[log][service]}"
}
}
# Parse stack traces for errors
if [level] in ["ERROR", "FATAL"] {
mutate {
add_field => {
"alert" => "true"
}
}
# Extract error details
if [log][error] {
mutate {
add_field => {
"error_type" => "%{[log][error][type]}"
"error_message" => "%{[log][error][message]}"
}
}
}
}
# Cleanup
mutate {
remove_field => ["message", "log"]
}
}
output {
elasticsearch {
hosts => ["es:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
workers => 4
}
# Send high priority errors to separate index
if [level] == "FATAL" {
elasticsearch {
hosts => ["es:9200"]
index => "critical-errors-%{+YYYY.MM.dd}"
}
}
}
Performance Monitoring
Elasticsearch Metrics
Key metrics to monitor:
# Cluster health
GET _cluster/health
# Node stats
GET _nodes/stats
# Index stats
GET logs-*/_stats
# Pending tasks
GET _cluster/pending_tasks
# Thread pool
GET _nodes/stats/thread_pool
Grafana queries:
{
"panels": [
{
"title": "Indexing Rate",
"target": "rate(elasticsearch_indices_indexing_index_total[5m])"
},
{
"title": "Search Rate",
"target": "rate(elasticsearch_indices_search_query_total[5m])"
},
{
"title": "JVM Heap Usage",
"target": "elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"} * 100"
},
{
"title": "Disk Usage",
"target": "elasticsearch_filesystem_data_used_bytes / elasticsearch_filesystem_data_size_bytes * 100"
}
]
}
Alerting Rules
Prometheus alerts:
groups:
- name: elasticsearch
rules:
- alert: ElasticsearchClusterNotHealthy
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 5m
labels:
severity: critical
- alert: ElasticsearchHighHeapUsage
expr: |
elasticsearch_jvm_memory_used_bytes{area="heap"}
/
elasticsearch_jvm_memory_max_bytes{area="heap"}
> 0.9
for: 15m
labels:
severity: warning
- alert: ElasticsearchHighDiskUsage
expr: |
elasticsearch_filesystem_data_used_bytes
/
elasticsearch_filesystem_data_size_bytes
> 0.85
for: 10m
labels:
severity: warning
Troubleshooting
Common Issues
Slow indexing:
# Check thread pool
GET _nodes/stats/thread_pool
# Check pending tasks
GET _cluster/pending_tasks
# Increase refresh interval
PUT logs-*/_settings
{
"index": {
"refresh_interval": "30s"
}
}
Out of memory:
# Check heap usage
GET _nodes/stats/jvm
# Check field data
GET _nodes/stats/indices/fielddata
# Clear field data cache
POST _cache/clear?fielddata=true
Slow searches:
# Profile query
GET logs-*/_search
{
"profile": true,
"query": {
"match_all": {}
}
}
# Use filters
# Limit time range
# Reduce result size
Best Practices
1. Index patterns:
- Time-based:
logs-YYYY.MM.dd
- Service-based:
logs-{service}-YYYY.MM.dd
- Environment-based:
logs-{env}-YYYY.MM.dd
2. Retention:
- Hot: 7 days
- Warm: 7-30 days
- Cold: 30-90 days
- Delete: > 90 days
3. Shard sizing:
- 10-50 GB per shard
- Max 20 shards per GB heap
4. Replicas:
- Production: 1 replica minimum
- Development: 0 replicas OK
5. Monitoring:
- Cluster health
- Heap usage
- Disk usage
- Indexing/search rates
Conclusion
Tuning the ELK stack requires:
- ILM policies - Automated index lifecycle management
- Shard optimization - Proper sizing and distribution
- Logstash pipelines - Efficient filtering and output
- Monitoring - Track performance metrics
- Resource allocation - Proper heap and memory sizing
Key takeaways:
- Use ILM for automatic index management
- Size shards between 10-50 GB
- Set JVM heap to 50% of RAM, max 32GB
- Optimize Logstash filters with conditionals
- Use dissect instead of grok when possible
- Monitor cluster health and resource usage
- Implement proper retention policies
Well-tuned ELK stack handles high log volumes efficiently while maintaining fast search performance.