Executive Summary

Neo4j is a native graph database that stores data as nodes (entities) connected by relationships (edges). Unlike relational databases that normalize data into tables, Neo4j excels at traversing relationships.

Quick decision:

  • Use Neo4j for: Knowledge graphs, authorization/identity, recommendations, fraud detection, network topology, impact analysis
  • Don’t use for: Heavy OLAP analytics, simple key-value workloads, document storage

Production deployment: Kubernetes + Helm (managed) or Docker Compose + Causal Cluster (self-managed)


What is Neo4j?

Core Concepts

Labeled Property Graph Model:

Node (entity)          Relationship (connection)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ :Person     β”‚        β”‚ KNOWS (since)  β”‚
β”‚ name: "Bob" │───────▢│ for: 5 years   β”‚
β”‚ age: 30     β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Nodes: Entities with labels and properties
  • Relationships: Typed, directed connections between nodes
  • Labels: Categories (e.g., :Person, :Company)
  • Properties: Key-value data on nodes and relationships
  • Cypher: SQL-like query language for graphs

When to Use Neo4j

Best use cases:

Use CaseExampleWhy Neo4j
Authorization/IdentityWho can access what?Fast relationship traversal (no JOINs)
Knowledge GraphsKnowledge bases, semantic searchHierarchies + connections
Recommendations“Customers like you also liked…”Pattern matching on user behavior
Fraud DetectionRing detection, money flowsDetect loops & suspicious patterns
Network/TopologyInfrastructure dependencies, blast radiusFast path queries
Impact Analysis“If this service fails, what breaks?”Upstream/downstream traversal

Don’t use Neo4j for:

  • Heavy OLAP analytics (use ClickHouse, BigQuery)
  • Simple key-value cache (use Redis)
  • Document storage (use MongoDB)
  • Time-series data (use InfluxDB, Prometheus)

Neo4j vs Other Databases

AspectNeo4jPostgreSQLMongoDBRedis
Data ModelGraph (nodes + relationships)Relational (tables)Document (JSON)Key-value
Query PatternTraverse relationshipsJOIN heavy tablesNested documentsSimple keys
Join CostO(1) β€” direct pointersO(n log n) β€” index scanO(n) β€” scanN/A
Real-time graph queriesβœ“ Fastβœ— Slow (many JOINs)βœ— Slowβœ— Can’t do this
Transactionsβœ“ ACIDβœ“ ACIDβœ“ ACIDLimited
ScalabilityVertical (clusters available)Horizontal (sharding hard)Horizontal (sharding built-in)Horizontal (easy)

Core Model & Queries

Cypher Essentials

CREATE nodes and relationships:

-- Create node with label and properties
CREATE (n:Person {name: "Alice", age: 30})

-- Create relationship
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS {since: 2020}]->(b)

-- Shorthand: CREATE with multiple nodes
CREATE (alice:Person {name: "Alice"})-[:KNOWS]->(bob:Person {name: "Bob"})

MATCH queries (read patterns):

-- Simple pattern
MATCH (p:Person {name: "Alice"})
RETURN p

-- Traverse relationships
MATCH (alice:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name

-- Multi-hop traversal (up to 3 steps)
MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..3]->(distant_friend)
RETURN distant_friend.name, distance

-- Find shortest path
MATCH path = shortestPath((a:Person)-[:KNOWS*]-(b:Person))
WHERE a.name = "Alice" AND b.name = "Charlie"
RETURN path

MERGE (upsert):

-- Update if exists, create if not
MERGE (p:Person {email: "[email protected]"})
ON CREATE SET p.created_at = timestamp()
ON MATCH SET p.updated_at = timestamp()
SET p.age = 31
RETURN p

Using parameters (IMPORTANT for security):

-- GOOD: Parameterized (prevents injection)
MATCH (p:Person {email: $email})
RETURN p
-- Query with: {email: "[email protected]"}

-- BAD: String interpolation (vulnerable)
MATCH (p:Person {email: "[email protected]"})
-- Don't do this!

Data Modeling Best Practices

Anti-pattern: Super-nodes

-- BAD: Everything connects to one node
(:Transaction)-[:INVOLVES]->(:Company)
(:Account)-[:BELONGS_TO]->(:Company)
(:Customer)-[:WORKS_FOR]->(:Company)
-- This creates a bottleneck; traversing the "Company" node is slow

-- GOOD: Use intermediate nodes and directions
(:Transaction)-[:TO_ACCOUNT]->(:Account)
(:Account)-[:AT_COMPANY]->(:Company)
(:Customer)-[:HAS_ACCOUNT]->(:Account)

Relationship directions matter:

-- Bad: Bidirectional or unclear
(a)-[:RELATED]-(b)

-- Good: Clear, purposeful direction
(author:Person)-[:WROTE]->(post:Post)
(post:Post)-[:POSTED_BY]->(author:Person)  -- or store in one direction only

Use constraints and indexes:

-- Create uniqueness constraint (also creates index)
CREATE CONSTRAINT email_unique ON (p:Person) ASSERT p.email IS UNIQUE

-- Create index for faster lookups
CREATE INDEX person_name ON :Person(name)

-- Create composite index
CREATE INDEX user_company ON (u:User)(company, role)

Deployment Options

1. Managed: Neo4j Aura

Pros:

  • Zero operations; fully managed
  • Auto backups, updates, scaling
  • Global deployment (multi-region)

Cons:

  • Higher cost ($100+/month minimum)
  • Less control (no custom plugins)
  • API-driven only (no direct cluster access)

Best for: SaaS companies, startups, teams without Kubernetes expertise


2. Self-Managed: Docker Compose (Development)

Quick start for local testing:

# docker-compose.yml
version: '3.8'
services:
  neo4j:
    image: neo4j:5.15-enterprise  # Use community for non-production
    container_name: neo4j
    ports:
      - "7474:7474"  # Browser UI
      - "7687:7687"  # Bolt protocol
    environment:
      NEO4J_AUTH: neo4j/password  # Change this!
      NEO4J_server_memory_heap_initial__size: 2G
      NEO4J_server_memory_heap_max__size: 4G
      NEO4J_server_memory_pagecache_size: 4G
    volumes:
      - neo4j_data:/var/lib/neo4j/data
      - neo4j_logs:/var/lib/neo4j/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7474"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  neo4j_data:
  neo4j_logs:

Run:

docker-compose up -d
# Access at http://localhost:7474
# Default username: neo4j, password: password

3. Self-Managed: Kubernetes + Helm (Production)

Architecture: Causal Cluster (3 nodes)

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Kubernetes Service    β”‚
                    β”‚  (Load Balancer/Ingress)β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚              β”‚              β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  PRIMARY/LEADER β”‚ β”‚   FOLLOWER    β”‚ β”‚  FOLLOWER    β”‚
       β”‚   (Write ops)   β”‚ β”‚ (Read replicas)β”‚ β”‚(Read replicas)β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚             β”‚                     β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   Replication (15 sec catchup)

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  Backup (Neo4j Backup    β”‚
                  β”‚  sidecar pod)            β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Helm values (minimal production setup):

# values-prod.yaml

# Image
image:
  repository: neo4j
  tag: 5.15-enterprise
  pullPolicy: IfNotPresent

# Licensing (required for Enterprise)
neo4j:
  # Get this from Neo4j: neo4j-com/neo4j-licensing
  licenseKey: "YOUR_LICENSE_KEY_HERE"

# Cluster mode
mode: CLUSTER
clusterSize: 3

# Memory allocation
jvm:
  heapInitialSize: 2G
  heapMaxSize: 4G
  pagecacheSizeGB: 4G

# Persistent volumes
volumes:
  data:
    mode: volumeClaimTemplate
    spec:
      storageClassName: fast-ssd  # Use SSD for better performance
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
  logs:
    mode: volumeClaimTemplate
    spec:
      storageClassName: standard
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi

# Resource limits
resources:
  requests:
    cpu: 2
    memory: 10Gi
  limits:
    cpu: 4
    memory: 12Gi

# Authentication
auth:
  enabled: true
  username: neo4j
  password: "SecurePassword123!"  # Use Kubernetes secret instead!

# TLS for in-flight encryption
tls:
  enabled: true
  privateKey:
    secretName: neo4j-tls-key
    keyName: tls.key
  certificate:
    secretName: neo4j-tls-cert
    certName: tls.crt

# Health checks
healthCheck:
  liveness:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
  readiness:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 10

# Service
service:
  type: ClusterIP
  ports:
    http: 7474
    https: 7473
    bolt: 7687
    boltSSL: 7688

# Ingress
ingress:
  enabled: true
  className: nginx
  hosts:
    - host: neo4j.example.com
      paths:
        - path: /
          pathType: Prefix

# Backups (sidecar)
backup:
  enabled: true
  schedule: "0 2 * * *"  # 2 AM daily
  volumeSize: 50Gi

Installation:

# Add Neo4j Helm repository
helm repo add neo4j https://helm.neo4j.com/neo4j
helm repo update

# Install with production values
helm install neo4j neo4j/neo4j \
  --namespace neo4j \
  --create-namespace \
  -f values-prod.yaml \
  --set auth.password=$(kubectl create secret generic neo4j-auth --from-literal=password=$PASSWORD -o jsonpath='{.data.password}' | base64 -d)

# Verify cluster
kubectl logs -n neo4j neo4j-0
kubectl logs -n neo4j neo4j-1
kubectl logs -n neo4j neo4j-2

# Port-forward to test
kubectl port-forward -n neo4j neo4j-0 7687:7687
# Connect with Cypher Shell: cypher-shell -a bolt://localhost:7687 -u neo4j -p password

Rolling Upgrades (Zero Downtime)

# 1. Check current version
kubectl get statefulset neo4j -n neo4j -o jsonpath='{.spec.template.spec.containers[0].image}'

# 2. Trigger rolling update (Kubernetes handles replicas automatically)
kubectl set image statefulset/neo4j \
  neo4j=neo4j:5.16-enterprise \
  -n neo4j

# 3. Watch rollout
kubectl rollout status statefulset/neo4j -n neo4j

# 4. Verify all nodes rejoined cluster
kubectl exec -it neo4j-0 -n neo4j -- cypher-shell \
  "CALL dbms.cluster.overview()" | grep FOLLOWER

Performance & Sizing

Memory Configuration

Neo4j has two main memory regions:

Heap: Runtime objects, query execution Page cache: On-disk data in memory (like OS filesystem cache)

Sizing formula:

Total machine memory:     32GB
Heap:                     4GB  (5x less than page cache)
Page cache:              16GB  (size of your hot dataset)
OS + overhead:           12GB  (for system + other processes)

Calculation:

# If dataset is 50GB, ~10GB hot:
heap_size: 2G
pagecache_size: 8G
total_needed: 10G per instance

Check page cache hit rate:

CALL dbms.queryJmx("java.lang:type=Memory") YIELD value
RETURN value
-- Look for: NonHeapMemoryUsage (page cache efficiency)

Query Optimization

Always use EXPLAIN or PROFILE:

-- EXPLAIN: Show execution plan (no execution)
EXPLAIN MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f

-- PROFILE: Execute and show actual stats
PROFILE MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f
-- Look for: db hits, rows, execution time

Common anti-patterns:

-- BAD: Scanning all nodes (expensive)
MATCH (p:Person) WHERE p.age > 30 RETURN p
-- Better: Create index on age

-- BAD: Large cartesian products (massive fan-out)
MATCH (a), (b) RETURN a, b  -- Don't do this!

-- BAD: Late filtering
MATCH (p:Person)-[:LIKES*10]-(other) RETURN other
-- Better: Add WHERE earlier to prune

-- GOOD: Index + early filtering
MATCH (p:Person {age: 30})-[:KNOWS]-(friend)
RETURN friend

Bulk Data Import

Option 1: LOAD CSV (flexible, slower)

-- Load from URL or local file
LOAD CSV WITH HEADERS FROM "file:///data/people.csv" AS row
CREATE (p:Person {
  name: row.name,
  age: toInteger(row.age),
  email: row.email
})

-- Batch with PERIODIC COMMIT
:auto LOAD CSV WITH HEADERS FROM "file:///data/relationships.csv" AS row
CALL apoc.periodic.commit('
  MERGE (a:Person {id: row.from})
  MERGE (b:Person {id: row.to})
  CREATE (a)-[:KNOWS]->(b)
', {batchSize: 1000})

Option 2: neo4j-admin import (fastest, bulk only)

# Prepare CSV files with specific format
# nodes-people.csv:
# id:ID,name:STRING,age:INT,:LABEL
# 1,Alice,30,Person
# 2,Bob,25,Person

# relationships-knows.csv:
# :START_ID,:END_ID,:TYPE,since:INT
# 1,2,KNOWS,2020
# 2,1,KNOWS,2020

# Stop Neo4j, run import
docker exec neo4j neo4j-admin import \
  --nodes=/data/nodes-people.csv \
  --relationships=/data/relationships-knows.csv \
  --database=neo4j \
  --force

# Restart
docker restart neo4j

Security

Authentication & Authorization

-- Create role with minimal permissions
CREATE ROLE analyst;
GRANT READ ON GRAPH neo4j TO analyst;

-- Create user with role
CREATE USER alice SET PASSWORD 'SecurePass123!' CHANGE REQUIRED;
GRANT ROLE analyst TO alice;

-- Grant database-level access
GRANT ACCESS ON DATABASE neo4j TO analyst;

-- Test as user
:logout
:param username => "alice"
:param password => "SecurePass123!"

Kubernetes Secret Management

Store credentials in Kubernetes secrets:

# Create secret
kubectl create secret generic neo4j-auth \
  --from-literal=username=neo4j \
  --from-literal=password=$(openssl rand -base64 32) \
  -n neo4j

# Reference in Helm values
auth:
  enabled: true
  username: neo4j
  passwordFromSecret:
    name: neo4j-auth
    key: password

TLS/MUTTLS

# values-prod.yaml
tls:
  enabled: true
  privateKey:
    secretName: neo4j-tls-key
    keyName: tls.key
  certificate:
    secretName: neo4j-tls-cert
    certName: tls.crt

# Generate self-signed cert
openssl req -x509 -newkey rsa:4096 -keyout tls.key -out tls.crt -days 365 -nodes \
  -subj "/CN=neo4j.default.svc.cluster.local"

kubectl create secret tls neo4j-tls-cert \
  --cert=tls.crt --key=tls.key -n neo4j

Network Policies

# NetworkPolicy: Only allow from application pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: neo4j-network-policy
  namespace: neo4j
spec:
  podSelector:
    matchLabels:
      app: neo4j
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: applications
        - podSelector:
            matchLabels:
              role: neo4j-client
      ports:
        - protocol: TCP
          port: 7687

Backups & Disaster Recovery

Why Backups Matter

Real scenario: Database corruption or data loss

Monday 2 AM: Automated backup completes (50GB compressed)
Monday 3 PM: Accidental DELETE query wipes 30% of nodes
Monday 4 PM: Incident discovered via monitoring alert
Monday 5 PM: Restore from backup, validate data
Monday 6 PM: System back online

Without backup: 
- Data permanently lost (30% of production data gone)
- Potential legal liability
- Manual recovery from logs (days of work, error-prone)

What is an online backup?

  • Snapshot of database while it’s running
  • No downtime (clients can still read/write)
  • Consistent point-in-time copy
  • Can be stored locally, S3, NFS, etc.

Step 1: Configure Backup Storage

Option A: Local Storage (Docker)

# Create backup directory on host
mkdir -p /backups/neo4j
chmod 755 /backups/neo4j

# Mount in docker-compose.yml
volumes:
  - neo4j_data:/var/lib/neo4j/data
  - /backups/neo4j:/backups  # Mount backup directory

Option B: Kubernetes PersistentVolume

# Create a backup volume for storing dumps
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: neo4j-backups
  namespace: neo4j
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 100Gi  # 100GB for backups (2-3 weeks of daily backups)

Step 2: Create Backup Script

The backup process:

#!/bin/bash
# File: neo4j-backup.sh
# Purpose: Daily backup of Neo4j database

set -e  # Exit on error

BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="neo4j-${DATE}.dump"
LOG_FILE="/var/log/neo4j-backup-${DATE}.log"

# Step 1: Check disk space before starting
AVAILABLE_SPACE=$(df $BACKUP_DIR | tail -1 | awk '{print $4}')
ESTIMATED_SIZE=50000000  # ~50GB in KB

if [ $AVAILABLE_SPACE -lt $ESTIMATED_SIZE ]; then
    echo "ERROR: Insufficient disk space! Available: ${AVAILABLE_SPACE}KB, Need: ${ESTIMATED_SIZE}KB" | tee $LOG_FILE
    # Send alert to monitoring
    curl -X POST $ALERT_WEBHOOK -d "Neo4j backup failed: disk full"
    exit 1
fi

# Step 2: Create backup
echo "[$(date)] Starting Neo4j backup..." | tee $LOG_FILE

docker exec neo4j neo4j-admin dump \
    --database=neo4j \
    --to=${BACKUP_DIR}/${BACKUP_FILE} \
    2>&1 | tee -a $LOG_FILE

BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${BACKUP_FILE} | cut -f1)

# Step 3: Verify backup integrity
echo "[$(date)] Verifying backup integrity..." | tee -a $LOG_FILE

if [ -f "${BACKUP_DIR}/${BACKUP_FILE}" ]; then
    echo "[$(date)] Backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})" | tee -a $LOG_FILE
else
    echo "ERROR: Backup file not created!" | tee -a $LOG_FILE
    exit 1
fi

# Step 4: Upload to remote storage (S3)
echo "[$(date)] Uploading to S3..." | tee -a $LOG_FILE

aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
    s3://company-backups/neo4j/ \
    --storage-class STANDARD_IA \  # Cheaper for archival
    2>&1 | tee -a $LOG_FILE

# Step 5: Clean up old local backups (keep 7 days)
echo "[$(date)] Cleaning up old backups (>7 days)..." | tee -a $LOG_FILE

find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete

# Step 6: Update backup manifest
echo "[$(date)] Updating backup manifest..." | tee -a $LOG_FILE

echo "${DATE} ${BACKUP_FILE} ${BACKUP_SIZE}" >> ${BACKUP_DIR}/manifest.log

# Step 7: Send success notification
echo "[$(date)] Backup completed successfully!" | tee -a $LOG_FILE

curl -X POST $SUCCESS_WEBHOOK \
    -d "Neo4j backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})"

What each step does:

  1. Check disk space - Prevent backup failures due to full disk
  2. Create dump - neo4j-admin dump creates consistent snapshot
  3. Verify file exists - Confirm backup actually created
  4. Upload to S3 - Remote copy for disaster recovery
  5. Clean up old backups - Free local disk space (keep 7 days)
  6. Update manifest - Track backup history
  7. Alert on success - Monitor that backups are happening

Step 3: Automate with Kubernetes CronJob

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: neo4j-backup-script
  namespace: neo4j
data:
  backup.sh: |
    #!/bin/bash
    set -e
    BACKUP_DIR="/backups"
    DATE=$(date +%Y%m%d-%H%M%S)
    BACKUP_FILE="neo4j-${DATE}.dump"
    
    echo "Starting backup at $(date)"
    
    # Create backup (must connect to neo4j-0 service)
    neo4j-admin dump \
      --database=neo4j \
      --to=${BACKUP_DIR}/${BACKUP_FILE}
    
    echo "Backup completed: ${BACKUP_FILE}"
    
    # Upload to S3
    aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
      s3://company-neo4j-backups/
    
    # Cleanup old backups
    find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete

---
# CronJob that runs daily at 2 AM UTC
apiVersion: batch/v1
kind: CronJob
metadata:
  name: neo4j-daily-backup
  namespace: neo4j
spec:
  # Schedule: minute hour day month dayOfWeek
  # 0 2 * * * = Every day at 2 AM
  schedule: "0 2 * * *"
  
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: neo4j-backup
          containers:
          - name: neo4j-admin
            image: neo4j:5.15-enterprise
            command:
              - /bin/bash
              - -c
              - |
                #!/bin/bash
                set -e
                
                # Wait for Neo4j to be ready
                until cypher-shell -a neo4j-0.neo4j.neo4j.svc.cluster.local \
                  "RETURN 1"; do
                  echo "Waiting for Neo4j to be ready..."
                  sleep 10
                done
                
                # Create backup
                neo4j-admin dump \
                  --database=neo4j \
                  --to=/backups/neo4j-$(date +%Y%m%d).dump
                
                # Upload to S3
                aws s3 cp /backups/neo4j-*.dump \
                  s3://company-neo4j-backups/
                
                # Cleanup
                find /backups -name "neo4j-*.dump" -mtime +7 -delete
                
                echo "Backup completed successfully"
            
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access_key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret_key
            
            volumeMounts:
            - name: backups
              mountPath: /backups
          
          volumes:
          - name: backups
            persistentVolumeClaim:
              claimName: neo4j-backups
          
          # Don't retry if backup fails (manual investigation needed)
          restartPolicy: Never
  
  # Keep backup jobs for 7 days
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7

Snapshot Strategy (Kubernetes - Faster Recovery)

When to use snapshots:

  • Faster recovery than full dump restoration
  • Point-in-time consistency
  • Good for infrastructure failures

When to use dumps:

  • Cross-region disaster recovery
  • Long-term archival
  • Database corruption (need to analyze)

Taking Snapshots

---
# StorageClass for snapshots (CSI driver must be installed)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: neo4j-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete

---
# Manual snapshot trigger (or use automation)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: neo4j-data-snapshot-$(date +%Y%m%d)
  namespace: neo4j
spec:
  volumeSnapshotClassName: neo4j-snapshot-class
  source:
    persistentVolumeClaimName: neo4j-data-neo4j-0  # Snapshot primary node

---
# Automated snapshots with CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: neo4j-hourly-snapshot
  namespace: neo4j
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: neo4j-snapshot
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              DATE=$(date +%Y%m%d-%H%M%S)
              kubectl apply -f - <<EOF
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: neo4j-data-snapshot-${DATE}
                namespace: neo4j
              spec:
                volumeSnapshotClassName: neo4j-snapshot-class
                source:
                  persistentVolumeClaimName: neo4j-data-neo4j-0
              EOF
          restartPolicy: Never

Snapshot cleanup:

# Keep only last 72 hours of snapshots (delete older ones)
kubectl delete volumesnapshot \
  -n neo4j \
  $(kubectl get volumesnapshot -n neo4j \
    --sort-by=.metadata.creationTimestamp \
    -o name | head -n -3)

Recovery Test (Quarterly Mandatory)

Why test recovery?

  • Backups are worthless if you can’t restore
  • Discover issues before real disaster
  • Validate RPO/RTO targets
  • Train team on recovery procedures

Step 1: Pre-Recovery Checklist

#!/bin/bash
# Pre-recovery validation

echo "=== Pre-Recovery Validation ==="

# Check backup integrity
BACKUP_FILE="/backups/neo4j-20250101.dump"

if [ ! -f "$BACKUP_FILE" ]; then
    echo "ERROR: Backup file not found: $BACKUP_FILE"
    exit 1
fi

BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
BACKUP_TIME=$(stat -f "%Sm" "$BACKUP_FILE")
echo "βœ“ Backup exists: $BACKUP_FILE ($BACKUP_SIZE, created $BACKUP_TIME)"

# Check staging environment has space
STAGING_SPACE=$(df /staging | tail -1 | awk '{print $4}')
REQUIRED_SPACE=$((BACKUP_SIZE * 3))  # Need 3x space (dump + unpacked + logs)

if [ $STAGING_SPACE -lt $REQUIRED_SPACE ]; then
    echo "ERROR: Insufficient space on staging!"
    exit 1
fi

echo "βœ“ Staging has sufficient disk space"

# Verify staging Neo4j is stopped
docker ps | grep -q neo4j-staging && docker stop neo4j-staging

echo "βœ“ Staging Neo4j stopped"

echo ""
echo "Pre-recovery checks complete. Ready to restore."

Step 2: Restore to Staging

#!/bin/bash
# Restore backup to staging environment

BACKUP_FILE="/backups/neo4j-20250101.dump"
START_TIME=$(date +%s)

echo "=== Starting Restore Process ==="
echo "Backup: $BACKUP_FILE"
echo "Start time: $(date)"

# Step 1: Clear staging database
echo ""
echo "Step 1: Clearing staging database..."
rm -rf /staging/neo4j/data/databases/*
rm -rf /staging/neo4j/data/transactions/*

# Step 2: Load backup
echo "Step 2: Loading backup into staging (this may take 10-30 minutes)..."

docker exec neo4j-staging neo4j-admin load \
    --database=neo4j \
    --from=$BACKUP_FILE \
    --force 2>&1 | tee restore.log

LOAD_STATUS=$?

if [ $LOAD_STATUS -ne 0 ]; then
    echo "ERROR: Restore failed! Check restore.log"
    exit 1
fi

echo "βœ“ Backup loaded successfully"

# Step 3: Start staging Neo4j
echo "Step 3: Starting staging Neo4j..."
docker start neo4j-staging

# Wait for startup
sleep 30

# Step 4: Verify connectivity
echo "Step 4: Verifying Neo4j is responding..."
RETRY_COUNT=0
until docker exec neo4j-staging cypher-shell \
    -u neo4j -p password "RETURN 1" > /dev/null 2>&1; do
    
    if [ $RETRY_COUNT -gt 30 ]; then
        echo "ERROR: Neo4j failed to start!"
        exit 1
    fi
    
    echo "  Waiting for Neo4j startup... (attempt $((RETRY_COUNT+1))/30)"
    sleep 10
    ((RETRY_COUNT++))
done

echo "βœ“ Neo4j is responding to queries"

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

echo ""
echo "=== Restore Complete ==="
echo "Duration: $((DURATION / 60)) minutes $((DURATION % 60)) seconds"
echo "Staging database ready for validation"

Step 3: Validate Data Integrity

#!/bin/bash
# Validate restored data

echo "=== Validating Restored Data ==="

# Validation 1: Count total nodes
echo "Validation 1: Node count..."
NODE_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (n) RETURN count(n) as count" \
    --format=csv | tail -1)

echo "  Total nodes: $NODE_COUNT"
[ "$NODE_COUNT" -gt 0 ] && echo "  βœ“ Nodes present" || exit 1

# Validation 2: Check relationships
echo "Validation 2: Relationship count..."
REL_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH ()-[r]->() RETURN count(r) as count" \
    --format=csv | tail -1)

echo "  Total relationships: $REL_COUNT"
[ "$REL_COUNT" -gt 0 ] && echo "  βœ“ Relationships present" || exit 1

# Validation 3: Check critical data
echo "Validation 3: Critical data check (Person nodes)..."
PERSON_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (p:Person) RETURN count(p) as count" \
    --format=csv | tail -1)

echo "  Person nodes: $PERSON_COUNT"
[ "$PERSON_COUNT" -gt 0 ] && echo "  βœ“ Critical data intact" || exit 1

# Validation 4: Check no orphaned data
echo "Validation 4: Checking for data consistency..."
docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (n) WHERE n.created_at IS NULL AND labels(n) <> [] 
     RETURN count(n) as orphaned"

# Validation 5: Performance spot-check
echo "Validation 5: Performance check..."
QUERY_TIME=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "PROFILE MATCH (p:Person)-[:KNOWS*1..3]-(friend) 
     RETURN count(distinct friend) as count" \
    --format=csv | grep "Plan" | tail -1)

echo "  Sample query execution: $QUERY_TIME"
echo "  βœ“ Queries executing normally"

echo ""
echo "=== All Validations Passed ==="

Step 4: Document RPO/RTO

#!/bin/bash
# Document actual recovery metrics

echo "=== Disaster Recovery Metrics ==="
echo ""
echo "RPO (Recovery Point Objective):"
echo "  - Daily backups at 2 AM UTC"
echo "  - Maximum data loss: 24 hours"
echo "  - Last backup: $(ls -lt /backups/neo4j-*.dump | head -1 | awk '{print $6, $7, $8}')"
echo ""

echo "RTO (Recovery Time Objective):"
echo "  - Restore time: ~20 minutes (for 50GB backup)"
echo "  - Validation time: ~5 minutes"
echo "  - Total RTO: ~30 minutes from decision to restore"
echo ""

echo "DR Readiness Checklist:"
echo "  βœ“ Backups automated (daily)"
echo "  βœ“ Backups tested (quarterly)"
echo "  βœ“ Restore procedure documented"
echo "  βœ“ Team trained on recovery"
echo "  βœ“ RPO/RTO targets defined and met"
echo ""

echo "Next DR Test: $(date -d '+3 months' +%Y-%m-%d)"

Complete Backup & Recovery Runbook

Quick reference for incidents:

# neo4j-dr-runbook.yaml

backup:
  frequency: daily
  time: 02:00 UTC
  method: neo4j-admin dump
  destination: s3://company-neo4j-backups/
  retention: 30 days
  size: ~50 GB (compressed)
  location: /backups/neo4j-YYYYMMDD.dump

restore_rto: 30 minutes
restore_steps:
  1_prepare: "Free disk space (100GB), stop staging Neo4j"
  2_load: "neo4j-admin load --from=backup.dump --force"
  3_startup: "Start Neo4j, wait for ready status"
  4_validate: "Run integrity checks, spot-check queries"
  5_switchover: "Update DNS/LB to point to staging"

testing_schedule:
  frequency: quarterly
  date: "First Thursday of each quarter"
  duration: "2 hours"
  participants: "SRE team, database lead, on-call engineer"

contacts:
  database_lead: "[email protected]"
  on_call: "See PagerDuty schedule"
  escalation: "#database-incidents on Slack"

Observability

Metrics (Prometheus + Grafana)

Enable JMX exporter:

# values-prod.yaml
jmx:
  enabled: true
  port: 9090

# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: neo4j
  namespace: neo4j
spec:
  selector:
    matchLabels:
      app: neo4j
  endpoints:
    - port: metrics
      interval: 30s

Key metrics to track:

# Neo4j-specific
neo4j_jvm_heap_used_bytes
neo4j_jvm_gc_time_seconds
neo4j_database_transactions_open
neo4j_query_execution_seconds  # P50, P99
neo4j_page_cache_fault_count
neo4j_cluster_replication_lag_seconds

# Application-level SLOs
query_latency_p50: 50ms
query_latency_p99: 500ms
query_timeout_rate: <0.1%

Logging

# values-prod.yaml
logging:
  enabled: true
  level: INFO
  slowLogThreshold: 1000  # Log queries slower than 1s

# Structured logging (JSON)
environment:
  NEO4J_dbms_logs_query_enabled: "true"
  NEO4J_dbms_logs_query_threshold: "1000ms"

Sample slow queries:

timestamp=2025-01-15T10:30:45Z query="MATCH (n)-[:KNOWS*10]-(m) RETURN m" 
duration=2450ms parameters={} client=127.0.0.1

Best Practices Checklist (Top 12)

  • Use parameterized queries to prevent injection attacks
  • Index frequently queried properties (name, email, external IDs)
  • Create uniqueness constraints where applicable (prevents duplicates)
  • Size page cache to ~80% of dataset for optimal performance
  • Monitor page cache hit rate (target >95%)
  • Run PROFILE on slow queries to identify inefficiencies
  • Batch writes with PERIODIC COMMIT (1000-5000 per batch)
  • Use Causal Cluster for HA (3+ nodes recommended)
  • Enable TLS for all client connections (in-flight encryption)
  • Implement network policies (restrict pod-to-pod access)
  • Daily backups with monthly restore tests (verify RPO/RTO)
  • Set resource limits (heap, pagecache, CPU) to prevent OOM

Top Pitfalls to Avoid

PitfallImpactSolution
Super-nodes (too many relationships)Traversal slowdownRedesign to use intermediate nodes
Missing indexesQuery timeoutProfile queries, add indexes proactively
Unbounded traversals (no LIMIT)Memory exhaustionUse LIMIT, constrain depth (*1..3)
String queries (concatenation)SQL injectionUse parameters always
Wrong memory split (too much heap)Page cache thrashingFollow sizing formula: 1/5 heap, rest pagecache
No backupsData lossAutomate daily backups, test recovery
Running Enterprise without licenseLegal/support issuesPurchase or use Community edition
Insufficient replicationSplit-brain scenariosUse 3+ node Causal Cluster, monitor replication lag
Skipping version upgradesSecurity vulnerabilitiesPlan quarterly upgrades, test on staging

CI/CD Example (Helm Deployment)

GitOps with ArgoCD:

# argocd-neo4j-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: neo4j
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://helm.neo4j.com/neo4j
    chart: neo4j
    targetRevision: "5.15.*"  # Pin minor version
    helm:
      releaseName: neo4j
      valuesObject:
        clusterSize: 3
        mode: CLUSTER
        jvm:
          heapInitialSize: 2G
          heapMaxSize: 4G
  destination:
    server: https://kubernetes.default.svc
    namespace: neo4j
  syncPolicy:
    automated:
      prune: false  # Manual approval for destructive changes
      selfHeal: true
    syncOptions:
      - Validate=true

Blue/Green Cluster Switch:

# Deploy new cluster (blue)
helm install neo4j-blue neo4j/neo4j \
  -f values-blue.yaml -n neo4j-blue

# Run smoke tests
kubectl run test-pod --image=neo4j-cypher-shell \
  -- -a neo4j-blue-0.neo4j-blue.neo4j-blue.svc.cluster.local \
  "CALL dbms.cluster.overview()"

# Switch service endpoint
kubectl patch service neo4j-router -p '{"spec":{"selector":{"app":"neo4j-blue"}}}'

# Keep green for quick rollback
helm uninstall neo4j-green

Architecture Diagram (Causal Cluster + K8s)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Kubernetes Cluster                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚            Ingress (HTTPS/TLS Termination)                  β”‚ β”‚
β”‚  β”‚  neo4j.example.com:443 β†’ ClusterIP:7687                     β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                       β”‚ Load balance                               β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚       β”‚               β”‚               β”‚                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚  StatefulSet: neo4j-0       β”‚ β”‚ neo4j-1     β”‚ β”‚ neo4j-2      β”‚               β”‚
β”‚  β”‚  PRIMARY                    β”‚ β”‚ FOLLOWER    β”‚ β”‚ FOLLOWER     β”‚               β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚               β”‚
β”‚  β”‚ β”‚Pod: neo4j-0          β”‚   β”‚ β”‚ β”‚Pod: 1  β”‚  β”‚ β”‚ β”‚Pod: 2  β”‚  β”‚               β”‚
β”‚  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚   β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚               β”‚
β”‚  β”‚ β”‚ β”‚neo4j:5.15      β”‚   β”‚   β”‚ β”‚  CPU: 2    β”‚ β”‚  CPU: 2    β”‚               β”‚
β”‚  β”‚ β”‚ β”‚Memory: 10Gi    β”‚   β”‚   β”‚ β”‚  Mem: 10Gi β”‚ β”‚  Mem: 10Gi β”‚               β”‚
β”‚  β”‚ β”‚ β”‚Heap: 4Gi       β”‚   β”‚   β”‚ β”‚            β”‚ β”‚            β”‚               β”‚
β”‚  β”‚ β”‚ β”‚PageCache: 4Gi  β”‚   β”‚   β”‚ β”‚            β”‚ β”‚            β”‚               β”‚
β”‚  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚ β”‚            β”‚ β”‚            β”‚               β”‚
β”‚  β”‚ β”‚ PVC: 100Gi (SSD)     β”‚   β”‚ β”‚ PVC: 100Gi β”‚ β”‚ PVC: 100Gi β”‚               β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”‚ SSD        β”‚ β”‚ SSD        β”‚               β”‚
β”‚  β”‚ Service (Headless)         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚  β”‚ neo4j-0.neo4j.svc.cluster β”‚                                              β”‚
β”‚  β”‚ neo4j-1.neo4j.svc.cluster β”‚ Causal Cluster (Raft Replication)           β”‚
β”‚  β”‚ neo4j-2.neo4j.svc.cluster β”‚ Follower-to-Primary lag: <1s                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                              β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚           Backup Sidecar (CronJob)                           β”‚ β”‚
β”‚  β”‚  Runs daily 2 AM                                             β”‚ β”‚
β”‚  β”‚  neo4j-admin dump β†’ S3/NFS β†’ 30 GB compressed              β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚           Observability Stack                                β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚ β”‚
β”‚  β”‚  β”‚ Prometheus   β”‚  β”‚ Grafana      β”‚  β”‚ Loki Logs    β”‚      β”‚ β”‚
β”‚  β”‚  β”‚ (Metrics)    β”‚  β”‚ (Dashboards) β”‚  β”‚ (Query logs) β”‚      β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Conclusion

Neo4j excels for relationship-heavy workloads at scale. Key takeaways:

  1. Production deployment: Kubernetes + Causal Cluster (3+ nodes)
  2. Performance: Size page cache to dataset, monitor hit rate
  3. Security: TLS + parameterized queries + RBAC
  4. Backups: Daily dumps, test recovery quarterly
  5. Observability: JMX metrics, slow query logs, health probes

For SRE teams: Treat Neo4j like any production databaseβ€”version pin, backup test, alert on replication lag, and assume failure will happen.