Introduction

The ELK stack (Elasticsearch, Logstash, Kibana) is powerful for log aggregation and analysis, but requires proper tuning for production workloads. This guide covers Elasticsearch index lifecycle management, Logstash pipeline optimization, and performance best practices.

Elasticsearch Index Lifecycle Management (ILM)

Understanding ILM

ILM automates index management through lifecycle phases:

Phases:

  1. Hot - Actively writing and querying
  2. Warm - No longer writing, still querying
  3. Cold - Rarely queried, compressed
  4. Frozen - Very rarely queried, minimal resources
  5. Delete - Removed from cluster

Basic ILM Policy

Create policy:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "7d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "backup-repo"
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Production ILM Policy

High-volume logs:

PUT _ilm/policy/production-logs
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "30gb",
            "max_docs": 10000000
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "readonly": {},
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Index Template with ILM

Create index template:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "production-logs",
      "index.lifecycle.rollover_alias": "logs-write",
      "codec": "best_compression"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "level": {
          "type": "keyword"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "service": {
          "type": "keyword"
        },
        "environment": {
          "type": "keyword"
        }
      }
    }
  }
}

Bootstrap first index:

PUT logs-000001
{
  "aliases": {
    "logs-write": {
      "is_write_index": true
    }
  }
}

Node Roles for Hot-Warm-Cold

elasticsearch.yml (hot node):

node.roles: [master, data_hot, ingest]
node.attr.data: hot

elasticsearch.yml (warm node):

node.roles: [data_warm]
node.attr.data: warm

elasticsearch.yml (cold node):

node.roles: [data_cold]
node.attr.data: cold

Monitoring ILM

Check policy:

# Get policy
GET _ilm/policy/production-logs

# Check index lifecycle status
GET logs-*/_ilm/explain

# Check specific index
GET logs-2025.10.15/_ilm/explain

Common ILM issues:

# Index stuck in a phase
GET logs-*/_ilm/explain?filter_path=*.step,*.failed_step

# Retry failed step
POST logs-2025.10.15/_ilm/retry

Elasticsearch Performance Tuning

1. Shard Sizing

Guidelines:

  • Shard size: 10-50 GB
  • Max shards per node: 20 per GB of heap
  • Avoid over-sharding

Calculate shards:

# Daily index calculation
daily_data_gb = 100  # GB per day
shard_size_gb = 30   # Target shard size

shards_needed = daily_data_gb / shard_size_gb
# Result: ~3 shards

replicas = 1  # One replica for HA

total_shards = shards_needed * (1 + replicas)
# Result: 6 shards total (3 primary + 3 replica)

Update shard count:

PUT _index_template/logs-template
{
  "template": {
    "settings": {
      "number_of_shards": 3,      // Calculated above
      "number_of_replicas": 1
    }
  }
}

2. Memory and Heap

JVM heap sizing:

# elasticsearch.yml or jvm.options
# Set heap to 50% of RAM, max 32GB

# 64GB RAM server
-Xms30g
-Xmx30g

# 16GB RAM server
-Xms8g
-Xmx8g

Why not more than 32GB:

  • Compressed object pointers disabled >32GB
  • Less efficient memory usage
  • Better to scale horizontally

Memory formula:

Total RAM = (JVM Heap × 2) + OS overhead

Example:
- JVM Heap: 30GB
- OS/Cache: 30GB + 4GB
- Total: 64GB RAM

3. Query Optimization

Use filters instead of queries:

// Slow (scored)
GET logs-*/_search
{
  "query": {
    "match": {
      "level": "ERROR"
    }
  }
}

// Fast (filtered, cached)
GET logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "level": "ERROR"
          }
        }
      ]
    }
  }
}

Use date ranges:

GET logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        }
      ]
    }
  }
}

Limit results:

GET logs-*/_search
{
  "size": 100,              // Limit results
  "from": 0,                // Pagination
  "_source": ["@timestamp", "message", "level"],  // Only needed fields
  "query": {
    "match_all": {}
  }
}

4. Mapping Optimization

Disable _source for metrics:

PUT metrics-*/_mapping
{
  "_source": {
    "enabled": false
  },
  "properties": {
    "cpu_usage": {
      "type": "float",
      "index": false,  // Not searchable, only for aggregations
      "doc_values": true
    }
  }
}

Use appropriate field types:

{
  "mappings": {
    "properties": {
      "ip_address": {
        "type": "ip"              // Not text
      },
      "status_code": {
        "type": "short"           // Not integer
      },
      "is_error": {
        "type": "boolean"         // Not keyword
      },
      "user_id": {
        "type": "keyword"         // Not text (exact match)
      },
      "message": {
        "type": "text",           // Analyzed text
        "index": true
      },
      "tags": {
        "type": "keyword"         // Array of keywords
      }
    }
  }
}

Disable indexing for unused fields:

{
  "mappings": {
    "properties": {
      "raw_log": {
        "type": "text",
        "index": false,     // Don't index, only store
        "store": true
      }
    }
  }
}

5. Bulk Indexing

Optimize bulk requests:

POST _bulk
{"index": {"_index": "logs-write"}}
{"@timestamp": "2025-10-15T10:00:00", "level": "INFO", "message": "Log 1"}
{"index": {"_index": "logs-write"}}
{"@timestamp": "2025-10-15T10:00:01", "level": "ERROR", "message": "Log 2"}

Bulk sizing:

# Optimal bulk size: 5-15 MB
# Test to find sweet spot

# Too small: Overhead
# Too large: Memory pressure, timeouts

Bulk settings:

# elasticsearch.yml
thread_pool.write.queue_size: 1000

Logstash Pipeline Optimization

1. Pipeline Configuration

logstash.yml:

# Worker threads (# of CPU cores)
pipeline.workers: 8

# Batch size (events per batch)
pipeline.batch.size: 125

# Batch delay (ms)
pipeline.batch.delay: 50

# Enable persistent queue
queue.type: persisted
queue.max_bytes: 1gb

2. Input Optimization

Beats input:

input {
  beats {
    port => 5044
    host => "0.0.0.0"

    # Connection settings
    client_inactivity_timeout => 300

    # Enable compression
    ssl => false  # Use if needed
  }
}

Kafka input (high throughput):

input {
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["logs"]
    group_id => "logstash"

    # Performance
    consumer_threads => 4
    fetch_min_bytes => "1024"
    fetch_max_wait_ms => "500"

    # Batch processing
    max_poll_records => "500"

    codec => json
  }
}

3. Filter Optimization

Efficient grok patterns:

filter {
  # Bad: Too many grok attempts
  grok {
    match => {
      "message" => [
        "%{PATTERN1}",
        "%{PATTERN2}",
        "%{PATTERN3}",
        "%{PATTERN4}"
      ]
    }
  }

  # Good: Pre-filter before grok
  if [service] == "nginx" {
    grok {
      match => {
        "message" => "%{NGINX_ACCESS_LOG}"
      }
    }
  } else if [service] == "app" {
    grok {
      match => {
        "message" => "%{APP_LOG}"
      }
    }
  }
}

Custom patterns:

# patterns/nginx
NGINX_ACCESS_LOG %{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}" %{INT:status} %{INT:body_bytes_sent}

Use dissect for simple parsing:

filter {
  # Dissect (faster than grok)
  dissect {
    mapping => {
      "message" => "%{timestamp} %{level} %{service} %{msg}"
    }
  }

  # Instead of grok
  # grok {
  #   match => {
  #     "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{WORD:service} %{GREEDYDATA:msg}"
  #   }
  # }
}

Conditional processing:

filter {
  # Skip debug logs in production
  if [level] == "DEBUG" and [environment] == "production" {
    drop {}
  }

  # Only parse errors with grok
  if [level] in ["ERROR", "FATAL"] {
    grok {
      match => {
        "message" => "%{STACK_TRACE}"
      }
    }
  }

  # Add tags for routing
  if [status_code] >= 500 {
    mutate {
      add_tag => ["error"]
      add_field => {
        "severity" => "high"
      }
    }
  }
}

4. Output Optimization

Elasticsearch output:

output {
  elasticsearch {
    hosts => ["es1:9200", "es2:9200", "es3:9200"]

    # Index naming
    index => "logs-%{[@metadata][environment]}-%{+YYYY.MM.dd}"

    # Performance
    workers => 4
    bulk_path => "/_bulk"

    # Retry settings
    retry_on_conflict => 3
    retry_max_interval => 5

    # Connection pool
    pool_max => 500
    pool_max_per_route => 100

    # Template management
    manage_template => true
    template_name => "logs"
    template_overwrite => true
  }
}

Dead letter queue:

output {
  elasticsearch {
    hosts => ["es:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }

  # Send failed events to DLQ
  if "_grokparsefailure" in [tags] {
    file {
      path => "/var/log/logstash/failed-events.log"
      codec => json_lines
    }
  }
}

5. Multiple Pipelines

pipelines.yml:

- pipeline.id: nginx-logs
  path.config: "/etc/logstash/conf.d/nginx.conf"
  pipeline.workers: 4
  queue.type: persisted

- pipeline.id: app-logs
  path.config: "/etc/logstash/conf.d/app.conf"
  pipeline.workers: 8
  queue.type: persisted

- pipeline.id: metrics
  path.config: "/etc/logstash/conf.d/metrics.conf"
  pipeline.workers: 2

Benefits:

  • Isolate workloads
  • Different configurations per pipeline
  • Better resource allocation
  • Easier troubleshooting

6. Monitoring Logstash

Monitoring API:

# Pipeline stats
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'

# JVM stats
curl -XGET 'localhost:9600/_node/stats/jvm?pretty'

# Hot threads
curl -XGET 'localhost:9600/_node/hot_threads?pretty'

Key metrics:

{
  "pipeline": {
    "events": {
      "in": 1000000,
      "filtered": 1000000,
      "out": 950000,
      "duration_in_millis": 5000
    },
    "queue": {
      "events": 500,
      "max_queue_size_in_bytes": 1073741824
    }
  }
}

Grafana dashboard queries:

# Events per second
rate(logstash_events_in[5m])

# Events filtered (parsing)
rate(logstash_events_filtered[5m])

# Events out (to Elasticsearch)
rate(logstash_events_out[5m])

# Queue size
logstash_queue_events

Filebeat Configuration

Efficient Filebeat Setup

filebeat.yml:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log

  # Multiline logs (stack traces)
  multiline.type: pattern
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

  # Fields
  fields:
    service: myapp
    environment: production
  fields_under_root: true

  # Performance
  close_inactive: 5m
  clean_removed: true

  # Exclude
  exclude_lines: ['^DEBUG']

# Processors
processors:
  - drop_event:
      when:
        regexp:
          message: '^DEBUG'

  - add_host_metadata:
      netinfo.enabled: false

  - add_kubernetes_metadata:
      in_cluster: true

# Output
output.logstash:
  hosts: ["logstash:5044"]

  # Load balancing
  loadbalance: true

  # Bulk settings
  bulk_max_size: 2048

  # Compression
  compression_level: 3

  # Worker threads
  worker: 2

# Monitoring
monitoring.enabled: true

Complete Example Pipeline

Application Logs

logstash/app-logs.conf:

input {
  beats {
    port => 5044
    client_inactivity_timeout => 300
  }
}

filter {
  # Parse JSON logs
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "log"
    }
  }

  # Parse timestamp
  date {
    match => ["[log][timestamp]", "ISO8601"]
    target => "@timestamp"
  }

  # Extract level
  mutate {
    add_field => {
      "level" => "%{[log][level]}"
      "service" => "%{[log][service]}"
    }
  }

  # Parse stack traces for errors
  if [level] in ["ERROR", "FATAL"] {
    mutate {
      add_field => {
        "alert" => "true"
      }
    }

    # Extract error details
    if [log][error] {
      mutate {
        add_field => {
          "error_type" => "%{[log][error][type]}"
          "error_message" => "%{[log][error][message]}"
        }
      }
    }
  }

  # Cleanup
  mutate {
    remove_field => ["message", "log"]
  }
}

output {
  elasticsearch {
    hosts => ["es:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
    workers => 4
  }

  # Send high priority errors to separate index
  if [level] == "FATAL" {
    elasticsearch {
      hosts => ["es:9200"]
      index => "critical-errors-%{+YYYY.MM.dd}"
    }
  }
}

Performance Monitoring

Elasticsearch Metrics

Key metrics to monitor:

# Cluster health
GET _cluster/health

# Node stats
GET _nodes/stats

# Index stats
GET logs-*/_stats

# Pending tasks
GET _cluster/pending_tasks

# Thread pool
GET _nodes/stats/thread_pool

Grafana queries:

{
  "panels": [
    {
      "title": "Indexing Rate",
      "target": "rate(elasticsearch_indices_indexing_index_total[5m])"
    },
    {
      "title": "Search Rate",
      "target": "rate(elasticsearch_indices_search_query_total[5m])"
    },
    {
      "title": "JVM Heap Usage",
      "target": "elasticsearch_jvm_memory_used_bytes{area=\"heap\"} / elasticsearch_jvm_memory_max_bytes{area=\"heap\"} * 100"
    },
    {
      "title": "Disk Usage",
      "target": "elasticsearch_filesystem_data_used_bytes / elasticsearch_filesystem_data_size_bytes * 100"
    }
  ]
}

Alerting Rules

Prometheus alerts:

groups:
- name: elasticsearch
  rules:
  - alert: ElasticsearchClusterNotHealthy
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 5m
    labels:
      severity: critical

  - alert: ElasticsearchHighHeapUsage
    expr: |
      elasticsearch_jvm_memory_used_bytes{area="heap"}
      /
      elasticsearch_jvm_memory_max_bytes{area="heap"}
      > 0.9
    for: 15m
    labels:
      severity: warning

  - alert: ElasticsearchHighDiskUsage
    expr: |
      elasticsearch_filesystem_data_used_bytes
      /
      elasticsearch_filesystem_data_size_bytes
      > 0.85
    for: 10m
    labels:
      severity: warning

Troubleshooting

Common Issues

Slow indexing:

# Check thread pool
GET _nodes/stats/thread_pool

# Check pending tasks
GET _cluster/pending_tasks

# Increase refresh interval
PUT logs-*/_settings
{
  "index": {
    "refresh_interval": "30s"
  }
}

Out of memory:

# Check heap usage
GET _nodes/stats/jvm

# Check field data
GET _nodes/stats/indices/fielddata

# Clear field data cache
POST _cache/clear?fielddata=true

Slow searches:

# Profile query
GET logs-*/_search
{
  "profile": true,
  "query": {
    "match_all": {}
  }
}

# Use filters
# Limit time range
# Reduce result size

Best Practices

1. Index patterns:

  • Time-based: logs-YYYY.MM.dd
  • Service-based: logs-{service}-YYYY.MM.dd
  • Environment-based: logs-{env}-YYYY.MM.dd

2. Retention:

  • Hot: 7 days
  • Warm: 7-30 days
  • Cold: 30-90 days
  • Delete: > 90 days

3. Shard sizing:

  • 10-50 GB per shard
  • Max 20 shards per GB heap

4. Replicas:

  • Production: 1 replica minimum
  • Development: 0 replicas OK

5. Monitoring:

  • Cluster health
  • Heap usage
  • Disk usage
  • Indexing/search rates

Conclusion

Tuning the ELK stack requires:

  1. ILM policies - Automated index lifecycle management
  2. Shard optimization - Proper sizing and distribution
  3. Logstash pipelines - Efficient filtering and output
  4. Monitoring - Track performance metrics
  5. Resource allocation - Proper heap and memory sizing

Key takeaways:

  • Use ILM for automatic index management
  • Size shards between 10-50 GB
  • Set JVM heap to 50% of RAM, max 32GB
  • Optimize Logstash filters with conditionals
  • Use dissect instead of grok when possible
  • Monitor cluster health and resource usage
  • Implement proper retention policies

Well-tuned ELK stack handles high log volumes efficiently while maintaining fast search performance.