Neo4j End-to-End Guide: Deployment, Operations & Best Practices

Executive Summary

Neo4j is a native graph database that stores data as nodes (entities) connected by relationships (edges). Unlike relational databases that normalize data into tables, Neo4j excels at traversing relationships.

Quick decision:

Use Neo4j for: Knowledge graphs, authorization/identity, recommendations, fraud detection, network topology, impact analysis
Don’t use for: Heavy OLAP analytics, simple key-value workloads, document storage

Production deployment: Kubernetes + Helm (managed) or Docker Compose + Causal Cluster (self-managed)

What is Neo4j?

Core Concepts

Labeled Property Graph Model:

Node (entity)          Relationship (connection)
┌─────────────┐        ┌────────────────┐
│ :Person     │        │ KNOWS (since)  │
│ name: "Bob" │───────▶│ for: 5 years   │
│ age: 30     │        └────────────────┘
└─────────────┘

Nodes: Entities with labels and properties
Relationships: Typed, directed connections between nodes
Labels: Categories (e.g., :Person, :Company)
Properties: Key-value data on nodes and relationships
Cypher: SQL-like query language for graphs

When to Use Neo4j

Best use cases:

Use Case	Example	Why Neo4j
Authorization/Identity	Who can access what?	Fast relationship traversal (no JOINs)
Knowledge Graphs	Knowledge bases, semantic search	Hierarchies + connections
Recommendations	“Customers like you also liked…”	Pattern matching on user behavior
Fraud Detection	Ring detection, money flows	Detect loops & suspicious patterns
Network/Topology	Infrastructure dependencies, blast radius	Fast path queries
Impact Analysis	“If this service fails, what breaks?”	Upstream/downstream traversal

Don’t use Neo4j for:

Heavy OLAP analytics (use ClickHouse, BigQuery)
Simple key-value cache (use Redis)
Document storage (use MongoDB)
Time-series data (use InfluxDB, Prometheus)

Neo4j vs Other Databases

Aspect	Neo4j	PostgreSQL	MongoDB	Redis
Data Model	Graph (nodes + relationships)	Relational (tables)	Document (JSON)	Key-value
Query Pattern	Traverse relationships	JOIN heavy tables	Nested documents	Simple keys
Join Cost	O(1) — direct pointers	O(n log n) — index scan	O(n) — scan	N/A
Real-time graph queries	✓ Fast	✗ Slow (many JOINs)	✗ Slow	✗ Can’t do this
Transactions	✓ ACID	✓ ACID	✓ ACID	Limited
Scalability	Vertical (clusters available)	Horizontal (sharding hard)	Horizontal (sharding built-in)	Horizontal (easy)

Core Model & Queries

Cypher Essentials

CREATE nodes and relationships:

-- Create node with label and properties
CREATE (n:Person {name: "Alice", age: 30})

-- Create relationship
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS {since: 2020}]->(b)

-- Shorthand: CREATE with multiple nodes
CREATE (alice:Person {name: "Alice"})-[:KNOWS]->(bob:Person {name: "Bob"})

MATCH queries (read patterns):

-- Simple pattern
MATCH (p:Person {name: "Alice"})
RETURN p

-- Traverse relationships
MATCH (alice:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name

-- Multi-hop traversal (up to 3 steps)
MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..3]->(distant_friend)
RETURN distant_friend.name, distance

-- Find shortest path
MATCH path = shortestPath((a:Person)-[:KNOWS*]-(b:Person))
WHERE a.name = "Alice" AND b.name = "Charlie"
RETURN path

MERGE (upsert):

-- Update if exists, create if not
MERGE (p:Person {email: "[email protected]"})
ON CREATE SET p.created_at = timestamp()
ON MATCH SET p.updated_at = timestamp()
SET p.age = 31
RETURN p

Using parameters (IMPORTANT for security):

-- GOOD: Parameterized (prevents injection)
MATCH (p:Person {email: $email})
RETURN p
-- Query with: {email: "[email protected]"}

-- BAD: String interpolation (vulnerable)
MATCH (p:Person {email: "[email protected]"})
-- Don't do this!

Data Modeling Best Practices

Anti-pattern: Super-nodes

-- BAD: Everything connects to one node
(:Transaction)-[:INVOLVES]->(:Company)
(:Account)-[:BELONGS_TO]->(:Company)
(:Customer)-[:WORKS_FOR]->(:Company)
-- This creates a bottleneck; traversing the "Company" node is slow

-- GOOD: Use intermediate nodes and directions
(:Transaction)-[:TO_ACCOUNT]->(:Account)
(:Account)-[:AT_COMPANY]->(:Company)
(:Customer)-[:HAS_ACCOUNT]->(:Account)

Relationship directions matter:

-- Bad: Bidirectional or unclear
(a)-[:RELATED]-(b)

-- Good: Clear, purposeful direction
(author:Person)-[:WROTE]->(post:Post)
(post:Post)-[:POSTED_BY]->(author:Person)  -- or store in one direction only

Use constraints and indexes:

-- Create uniqueness constraint (also creates index)
CREATE CONSTRAINT email_unique ON (p:Person) ASSERT p.email IS UNIQUE

-- Create index for faster lookups
CREATE INDEX person_name ON :Person(name)

-- Create composite index
CREATE INDEX user_company ON (u:User)(company, role)

Deployment Options

1. Managed: Neo4j Aura

Pros:

Zero operations; fully managed
Auto backups, updates, scaling
Global deployment (multi-region)

Cons:

Higher cost ($100+/month minimum)
Less control (no custom plugins)
API-driven only (no direct cluster access)

Best for: SaaS companies, startups, teams without Kubernetes expertise

2. Self-Managed: Docker Compose (Development)

Quick start for local testing:

# docker-compose.yml
version: '3.8'
services:
  neo4j:
    image: neo4j:5.15-enterprise  # Use community for non-production
    container_name: neo4j
    ports:
      - "7474:7474"  # Browser UI
      - "7687:7687"  # Bolt protocol
    environment:
      NEO4J_AUTH: neo4j/password  # Change this!
      NEO4J_server_memory_heap_initial__size: 2G
      NEO4J_server_memory_heap_max__size: 4G
      NEO4J_server_memory_pagecache_size: 4G
    volumes:
      - neo4j_data:/var/lib/neo4j/data
      - neo4j_logs:/var/lib/neo4j/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7474"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  neo4j_data:
  neo4j_logs:

Run:

docker-compose up -d
# Access at http://localhost:7474
# Default username: neo4j, password: password

3. Self-Managed: Kubernetes + Helm (Production)

Architecture: Causal Cluster (3 nodes)

                    ┌─────────────────────────┐
                    │   Kubernetes Service    │
                    │  (Load Balancer/Ingress)│
                    └──────────┬──────────────┘
                               │
                ┌──────────────┼──────────────┐
                │              │              │
       ┌────────▼────────┐ ┌──▼────────────┐ ┌──────────────┐
       │  PRIMARY/LEADER │ │   FOLLOWER    │ │  FOLLOWER    │
       │   (Write ops)   │ │ (Read replicas)│ │(Read replicas)│
       └────────┬────────┘ └──┬────────────┘ └──────┬───────┘
                │             │                     │
                └─────────────┴─────────────────────┘
                   Replication (15 sec catchup)

                  ┌──────────────────────────┐
                  │  Backup (Neo4j Backup    │
                  │  sidecar pod)            │
                  └──────────────────────────┘

Helm values (minimal production setup):

# values-prod.yaml

# Image
image:
  repository: neo4j
  tag: 5.15-enterprise
  pullPolicy: IfNotPresent

# Licensing (required for Enterprise)
neo4j:
  # Get this from Neo4j: neo4j-com/neo4j-licensing
  licenseKey: "YOUR_LICENSE_KEY_HERE"

# Cluster mode
mode: CLUSTER
clusterSize: 3

# Memory allocation
jvm:
  heapInitialSize: 2G
  heapMaxSize: 4G
  pagecacheSizeGB: 4G

# Persistent volumes
volumes:
  data:
    mode: volumeClaimTemplate
    spec:
      storageClassName: fast-ssd  # Use SSD for better performance
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
  logs:
    mode: volumeClaimTemplate
    spec:
      storageClassName: standard
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi

# Resource limits
resources:
  requests:
    cpu: 2
    memory: 10Gi
  limits:
    cpu: 4
    memory: 12Gi

# Authentication
auth:
  enabled: true
  username: neo4j
  password: "SecurePassword123!"  # Use Kubernetes secret instead!

# TLS for in-flight encryption
tls:
  enabled: true
  privateKey:
    secretName: neo4j-tls-key
    keyName: tls.key
  certificate:
    secretName: neo4j-tls-cert
    certName: tls.crt

# Health checks
healthCheck:
  liveness:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
  readiness:
    enabled: true
    initialDelaySeconds: 15
    periodSeconds: 10

# Service
service:
  type: ClusterIP
  ports:
    http: 7474
    https: 7473
    bolt: 7687
    boltSSL: 7688

# Ingress
ingress:
  enabled: true
  className: nginx
  hosts:
    - host: neo4j.example.com
      paths:
        - path: /
          pathType: Prefix

# Backups (sidecar)
backup:
  enabled: true
  schedule: "0 2 * * *"  # 2 AM daily
  volumeSize: 50Gi

Installation:

# Add Neo4j Helm repository
helm repo add neo4j https://helm.neo4j.com/neo4j
helm repo update

# Install with production values
helm install neo4j neo4j/neo4j \
  --namespace neo4j \
  --create-namespace \
  -f values-prod.yaml \
  --set auth.password=$(kubectl create secret generic neo4j-auth --from-literal=password=$PASSWORD -o jsonpath='{.data.password}' | base64 -d)

# Verify cluster
kubectl logs -n neo4j neo4j-0
kubectl logs -n neo4j neo4j-1
kubectl logs -n neo4j neo4j-2

# Port-forward to test
kubectl port-forward -n neo4j neo4j-0 7687:7687
# Connect with Cypher Shell: cypher-shell -a bolt://localhost:7687 -u neo4j -p password

Rolling Upgrades (Zero Downtime)

# 1. Check current version
kubectl get statefulset neo4j -n neo4j -o jsonpath='{.spec.template.spec.containers[0].image}'

# 2. Trigger rolling update (Kubernetes handles replicas automatically)
kubectl set image statefulset/neo4j \
  neo4j=neo4j:5.16-enterprise \
  -n neo4j

# 3. Watch rollout
kubectl rollout status statefulset/neo4j -n neo4j

# 4. Verify all nodes rejoined cluster
kubectl exec -it neo4j-0 -n neo4j -- cypher-shell \
  "CALL dbms.cluster.overview()" | grep FOLLOWER

Performance & Sizing

Memory Configuration

Neo4j has two main memory regions:

Heap: Runtime objects, query execution Page cache: On-disk data in memory (like OS filesystem cache)

Sizing formula:

Total machine memory:     32GB
Heap:                     4GB  (5x less than page cache)
Page cache:              16GB  (size of your hot dataset)
OS + overhead:           12GB  (for system + other processes)

Calculation:

# If dataset is 50GB, ~10GB hot:
heap_size: 2G
pagecache_size: 8G
total_needed: 10G per instance

Check page cache hit rate:

CALL dbms.queryJmx("java.lang:type=Memory") YIELD value
RETURN value
-- Look for: NonHeapMemoryUsage (page cache efficiency)

Query Optimization

Always use EXPLAIN or PROFILE:

-- EXPLAIN: Show execution plan (no execution)
EXPLAIN MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f

-- PROFILE: Execute and show actual stats
PROFILE MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f
-- Look for: db hits, rows, execution time

Common anti-patterns:

-- BAD: Scanning all nodes (expensive)
MATCH (p:Person) WHERE p.age > 30 RETURN p
-- Better: Create index on age

-- BAD: Large cartesian products (massive fan-out)
MATCH (a), (b) RETURN a, b  -- Don't do this!

-- BAD: Late filtering
MATCH (p:Person)-[:LIKES*10]-(other) RETURN other
-- Better: Add WHERE earlier to prune

-- GOOD: Index + early filtering
MATCH (p:Person {age: 30})-[:KNOWS]-(friend)
RETURN friend

Bulk Data Import

Option 1: LOAD CSV (flexible, slower)

-- Load from URL or local file
LOAD CSV WITH HEADERS FROM "file:///data/people.csv" AS row
CREATE (p:Person {
  name: row.name,
  age: toInteger(row.age),
  email: row.email
})

-- Batch with PERIODIC COMMIT
:auto LOAD CSV WITH HEADERS FROM "file:///data/relationships.csv" AS row
CALL apoc.periodic.commit('
  MERGE (a:Person {id: row.from})
  MERGE (b:Person {id: row.to})
  CREATE (a)-[:KNOWS]->(b)
', {batchSize: 1000})

Option 2: neo4j-admin import (fastest, bulk only)

# Prepare CSV files with specific format
# nodes-people.csv:
# id:ID,name:STRING,age:INT,:LABEL
# 1,Alice,30,Person
# 2,Bob,25,Person

# relationships-knows.csv:
# :START_ID,:END_ID,:TYPE,since:INT
# 1,2,KNOWS,2020
# 2,1,KNOWS,2020

# Stop Neo4j, run import
docker exec neo4j neo4j-admin import \
  --nodes=/data/nodes-people.csv \
  --relationships=/data/relationships-knows.csv \
  --database=neo4j \
  --force

# Restart
docker restart neo4j

Security

Authentication & Authorization

-- Create role with minimal permissions
CREATE ROLE analyst;
GRANT READ ON GRAPH neo4j TO analyst;

-- Create user with role
CREATE USER alice SET PASSWORD 'SecurePass123!' CHANGE REQUIRED;
GRANT ROLE analyst TO alice;

-- Grant database-level access
GRANT ACCESS ON DATABASE neo4j TO analyst;

-- Test as user
:logout
:param username => "alice"
:param password => "SecurePass123!"

Kubernetes Secret Management

Store credentials in Kubernetes secrets:

# Create secret
kubectl create secret generic neo4j-auth \
  --from-literal=username=neo4j \
  --from-literal=password=$(openssl rand -base64 32) \
  -n neo4j

# Reference in Helm values
auth:
  enabled: true
  username: neo4j
  passwordFromSecret:
    name: neo4j-auth
    key: password

TLS/MUTTLS

# values-prod.yaml
tls:
  enabled: true
  privateKey:
    secretName: neo4j-tls-key
    keyName: tls.key
  certificate:
    secretName: neo4j-tls-cert
    certName: tls.crt

# Generate self-signed cert
openssl req -x509 -newkey rsa:4096 -keyout tls.key -out tls.crt -days 365 -nodes \
  -subj "/CN=neo4j.default.svc.cluster.local"

kubectl create secret tls neo4j-tls-cert \
  --cert=tls.crt --key=tls.key -n neo4j

Network Policies

# NetworkPolicy: Only allow from application pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: neo4j-network-policy
  namespace: neo4j
spec:
  podSelector:
    matchLabels:
      app: neo4j
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: applications
        - podSelector:
            matchLabels:
              role: neo4j-client
      ports:
        - protocol: TCP
          port: 7687

Backups & Disaster Recovery

Why Backups Matter

Real scenario: Database corruption or data loss

Monday 2 AM: Automated backup completes (50GB compressed)
Monday 3 PM: Accidental DELETE query wipes 30% of nodes
Monday 4 PM: Incident discovered via monitoring alert
Monday 5 PM: Restore from backup, validate data
Monday 6 PM: System back online

Without backup: 
- Data permanently lost (30% of production data gone)
- Potential legal liability
- Manual recovery from logs (days of work, error-prone)

Online Backups (Recommended)

What is an online backup?

Snapshot of database while it’s running
No downtime (clients can still read/write)
Consistent point-in-time copy
Can be stored locally, S3, NFS, etc.

Step 1: Configure Backup Storage

Option A: Local Storage (Docker)

# Create backup directory on host
mkdir -p /backups/neo4j
chmod 755 /backups/neo4j

# Mount in docker-compose.yml
volumes:
  - neo4j_data:/var/lib/neo4j/data
  - /backups/neo4j:/backups  # Mount backup directory

Option B: Kubernetes PersistentVolume

# Create a backup volume for storing dumps
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: neo4j-backups
  namespace: neo4j
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 100Gi  # 100GB for backups (2-3 weeks of daily backups)

Step 2: Create Backup Script

The backup process:

#!/bin/bash
# File: neo4j-backup.sh
# Purpose: Daily backup of Neo4j database

set -e  # Exit on error

BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="neo4j-${DATE}.dump"
LOG_FILE="/var/log/neo4j-backup-${DATE}.log"

# Step 1: Check disk space before starting
AVAILABLE_SPACE=$(df $BACKUP_DIR | tail -1 | awk '{print $4}')
ESTIMATED_SIZE=50000000  # ~50GB in KB

if [ $AVAILABLE_SPACE -lt $ESTIMATED_SIZE ]; then
    echo "ERROR: Insufficient disk space! Available: ${AVAILABLE_SPACE}KB, Need: ${ESTIMATED_SIZE}KB" | tee $LOG_FILE
    # Send alert to monitoring
    curl -X POST $ALERT_WEBHOOK -d "Neo4j backup failed: disk full"
    exit 1
fi

# Step 2: Create backup
echo "[$(date)] Starting Neo4j backup..." | tee $LOG_FILE

docker exec neo4j neo4j-admin dump \
    --database=neo4j \
    --to=${BACKUP_DIR}/${BACKUP_FILE} \
    2>&1 | tee -a $LOG_FILE

BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${BACKUP_FILE} | cut -f1)

# Step 3: Verify backup integrity
echo "[$(date)] Verifying backup integrity..." | tee -a $LOG_FILE

if [ -f "${BACKUP_DIR}/${BACKUP_FILE}" ]; then
    echo "[$(date)] Backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})" | tee -a $LOG_FILE
else
    echo "ERROR: Backup file not created!" | tee -a $LOG_FILE
    exit 1
fi

# Step 4: Upload to remote storage (S3)
echo "[$(date)] Uploading to S3..." | tee -a $LOG_FILE

aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
    s3://company-backups/neo4j/ \
    --storage-class STANDARD_IA \  # Cheaper for archival
    2>&1 | tee -a $LOG_FILE

# Step 5: Clean up old local backups (keep 7 days)
echo "[$(date)] Cleaning up old backups (>7 days)..." | tee -a $LOG_FILE

find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete

# Step 6: Update backup manifest
echo "[$(date)] Updating backup manifest..." | tee -a $LOG_FILE

echo "${DATE} ${BACKUP_FILE} ${BACKUP_SIZE}" >> ${BACKUP_DIR}/manifest.log

# Step 7: Send success notification
echo "[$(date)] Backup completed successfully!" | tee -a $LOG_FILE

curl -X POST $SUCCESS_WEBHOOK \
    -d "Neo4j backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})"

What each step does:

Check disk space - Prevent backup failures due to full disk
Create dump - neo4j-admin dump creates consistent snapshot
Verify file exists - Confirm backup actually created
Upload to S3 - Remote copy for disaster recovery
Clean up old backups - Free local disk space (keep 7 days)
Update manifest - Track backup history
Alert on success - Monitor that backups are happening

Step 3: Automate with Kubernetes CronJob

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: neo4j-backup-script
  namespace: neo4j
data:
  backup.sh: |
    #!/bin/bash
    set -e
    BACKUP_DIR="/backups"
    DATE=$(date +%Y%m%d-%H%M%S)
    BACKUP_FILE="neo4j-${DATE}.dump"
    
    echo "Starting backup at $(date)"
    
    # Create backup (must connect to neo4j-0 service)
    neo4j-admin dump \
      --database=neo4j \
      --to=${BACKUP_DIR}/${BACKUP_FILE}
    
    echo "Backup completed: ${BACKUP_FILE}"
    
    # Upload to S3
    aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
      s3://company-neo4j-backups/
    
    # Cleanup old backups
    find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete

---
# CronJob that runs daily at 2 AM UTC
apiVersion: batch/v1
kind: CronJob
metadata:
  name: neo4j-daily-backup
  namespace: neo4j
spec:
  # Schedule: minute hour day month dayOfWeek
  # 0 2 * * * = Every day at 2 AM
  schedule: "0 2 * * *"
  
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: neo4j-backup
          containers:
          - name: neo4j-admin
            image: neo4j:5.15-enterprise
            command:
              - /bin/bash
              - -c
              - |
                #!/bin/bash
                set -e
                
                # Wait for Neo4j to be ready
                until cypher-shell -a neo4j-0.neo4j.neo4j.svc.cluster.local \
                  "RETURN 1"; do
                  echo "Waiting for Neo4j to be ready..."
                  sleep 10
                done
                
                # Create backup
                neo4j-admin dump \
                  --database=neo4j \
                  --to=/backups/neo4j-$(date +%Y%m%d).dump
                
                # Upload to S3
                aws s3 cp /backups/neo4j-*.dump \
                  s3://company-neo4j-backups/
                
                # Cleanup
                find /backups -name "neo4j-*.dump" -mtime +7 -delete
                
                echo "Backup completed successfully"
            
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access_key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret_key
            
            volumeMounts:
            - name: backups
              mountPath: /backups
          
          volumes:
          - name: backups
            persistentVolumeClaim:
              claimName: neo4j-backups
          
          # Don't retry if backup fails (manual investigation needed)
          restartPolicy: Never
  
  # Keep backup jobs for 7 days
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7

Snapshot Strategy (Kubernetes - Faster Recovery)

When to use snapshots:

Faster recovery than full dump restoration
Point-in-time consistency
Good for infrastructure failures

When to use dumps:

Cross-region disaster recovery
Long-term archival
Database corruption (need to analyze)

Taking Snapshots

---
# StorageClass for snapshots (CSI driver must be installed)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: neo4j-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete

---
# Manual snapshot trigger (or use automation)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: neo4j-data-snapshot-$(date +%Y%m%d)
  namespace: neo4j
spec:
  volumeSnapshotClassName: neo4j-snapshot-class
  source:
    persistentVolumeClaimName: neo4j-data-neo4j-0  # Snapshot primary node

---
# Automated snapshots with CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: neo4j-hourly-snapshot
  namespace: neo4j
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: neo4j-snapshot
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              DATE=$(date +%Y%m%d-%H%M%S)
              kubectl apply -f - <<EOF
              apiVersion: snapshot.storage.k8s.io/v1
              kind: VolumeSnapshot
              metadata:
                name: neo4j-data-snapshot-${DATE}
                namespace: neo4j
              spec:
                volumeSnapshotClassName: neo4j-snapshot-class
                source:
                  persistentVolumeClaimName: neo4j-data-neo4j-0
              EOF
          restartPolicy: Never

Snapshot cleanup:

# Keep only last 72 hours of snapshots (delete older ones)
kubectl delete volumesnapshot \
  -n neo4j \
  $(kubectl get volumesnapshot -n neo4j \
    --sort-by=.metadata.creationTimestamp \
    -o name | head -n -3)

Recovery Test (Quarterly Mandatory)

Why test recovery?

Backups are worthless if you can’t restore
Discover issues before real disaster
Validate RPO/RTO targets
Train team on recovery procedures

Step 1: Pre-Recovery Checklist

#!/bin/bash
# Pre-recovery validation

echo "=== Pre-Recovery Validation ==="

# Check backup integrity
BACKUP_FILE="/backups/neo4j-20250101.dump"

if [ ! -f "$BACKUP_FILE" ]; then
    echo "ERROR: Backup file not found: $BACKUP_FILE"
    exit 1
fi

BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
BACKUP_TIME=$(stat -f "%Sm" "$BACKUP_FILE")
echo "✓ Backup exists: $BACKUP_FILE ($BACKUP_SIZE, created $BACKUP_TIME)"

# Check staging environment has space
STAGING_SPACE=$(df /staging | tail -1 | awk '{print $4}')
REQUIRED_SPACE=$((BACKUP_SIZE * 3))  # Need 3x space (dump + unpacked + logs)

if [ $STAGING_SPACE -lt $REQUIRED_SPACE ]; then
    echo "ERROR: Insufficient space on staging!"
    exit 1
fi

echo "✓ Staging has sufficient disk space"

# Verify staging Neo4j is stopped
docker ps | grep -q neo4j-staging && docker stop neo4j-staging

echo "✓ Staging Neo4j stopped"

echo ""
echo "Pre-recovery checks complete. Ready to restore."

Step 2: Restore to Staging

#!/bin/bash
# Restore backup to staging environment

BACKUP_FILE="/backups/neo4j-20250101.dump"
START_TIME=$(date +%s)

echo "=== Starting Restore Process ==="
echo "Backup: $BACKUP_FILE"
echo "Start time: $(date)"

# Step 1: Clear staging database
echo ""
echo "Step 1: Clearing staging database..."
rm -rf /staging/neo4j/data/databases/*
rm -rf /staging/neo4j/data/transactions/*

# Step 2: Load backup
echo "Step 2: Loading backup into staging (this may take 10-30 minutes)..."

docker exec neo4j-staging neo4j-admin load \
    --database=neo4j \
    --from=$BACKUP_FILE \
    --force 2>&1 | tee restore.log

LOAD_STATUS=$?

if [ $LOAD_STATUS -ne 0 ]; then
    echo "ERROR: Restore failed! Check restore.log"
    exit 1
fi

echo "✓ Backup loaded successfully"

# Step 3: Start staging Neo4j
echo "Step 3: Starting staging Neo4j..."
docker start neo4j-staging

# Wait for startup
sleep 30

# Step 4: Verify connectivity
echo "Step 4: Verifying Neo4j is responding..."
RETRY_COUNT=0
until docker exec neo4j-staging cypher-shell \
    -u neo4j -p password "RETURN 1" > /dev/null 2>&1; do
    
    if [ $RETRY_COUNT -gt 30 ]; then
        echo "ERROR: Neo4j failed to start!"
        exit 1
    fi
    
    echo "  Waiting for Neo4j startup... (attempt $((RETRY_COUNT+1))/30)"
    sleep 10
    ((RETRY_COUNT++))
done

echo "✓ Neo4j is responding to queries"

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

echo ""
echo "=== Restore Complete ==="
echo "Duration: $((DURATION / 60)) minutes $((DURATION % 60)) seconds"
echo "Staging database ready for validation"

Step 3: Validate Data Integrity

#!/bin/bash
# Validate restored data

echo "=== Validating Restored Data ==="

# Validation 1: Count total nodes
echo "Validation 1: Node count..."
NODE_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (n) RETURN count(n) as count" \
    --format=csv | tail -1)

echo "  Total nodes: $NODE_COUNT"
[ "$NODE_COUNT" -gt 0 ] && echo "  ✓ Nodes present" || exit 1

# Validation 2: Check relationships
echo "Validation 2: Relationship count..."
REL_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH ()-[r]->() RETURN count(r) as count" \
    --format=csv | tail -1)

echo "  Total relationships: $REL_COUNT"
[ "$REL_COUNT" -gt 0 ] && echo "  ✓ Relationships present" || exit 1

# Validation 3: Check critical data
echo "Validation 3: Critical data check (Person nodes)..."
PERSON_COUNT=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (p:Person) RETURN count(p) as count" \
    --format=csv | tail -1)

echo "  Person nodes: $PERSON_COUNT"
[ "$PERSON_COUNT" -gt 0 ] && echo "  ✓ Critical data intact" || exit 1

# Validation 4: Check no orphaned data
echo "Validation 4: Checking for data consistency..."
docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "MATCH (n) WHERE n.created_at IS NULL AND labels(n) <> [] 
     RETURN count(n) as orphaned"

# Validation 5: Performance spot-check
echo "Validation 5: Performance check..."
QUERY_TIME=$(docker exec neo4j-staging cypher-shell \
    -u neo4j -p password \
    "PROFILE MATCH (p:Person)-[:KNOWS*1..3]-(friend) 
     RETURN count(distinct friend) as count" \
    --format=csv | grep "Plan" | tail -1)

echo "  Sample query execution: $QUERY_TIME"
echo "  ✓ Queries executing normally"

echo ""
echo "=== All Validations Passed ==="

Step 4: Document RPO/RTO

#!/bin/bash
# Document actual recovery metrics

echo "=== Disaster Recovery Metrics ==="
echo ""
echo "RPO (Recovery Point Objective):"
echo "  - Daily backups at 2 AM UTC"
echo "  - Maximum data loss: 24 hours"
echo "  - Last backup: $(ls -lt /backups/neo4j-*.dump | head -1 | awk '{print $6, $7, $8}')"
echo ""

echo "RTO (Recovery Time Objective):"
echo "  - Restore time: ~20 minutes (for 50GB backup)"
echo "  - Validation time: ~5 minutes"
echo "  - Total RTO: ~30 minutes from decision to restore"
echo ""

echo "DR Readiness Checklist:"
echo "  ✓ Backups automated (daily)"
echo "  ✓ Backups tested (quarterly)"
echo "  ✓ Restore procedure documented"
echo "  ✓ Team trained on recovery"
echo "  ✓ RPO/RTO targets defined and met"
echo ""

echo "Next DR Test: $(date -d '+3 months' +%Y-%m-%d)"

Complete Backup & Recovery Runbook

Quick reference for incidents:

# neo4j-dr-runbook.yaml

backup:
  frequency: daily
  time: 02:00 UTC
  method: neo4j-admin dump
  destination: s3://company-neo4j-backups/
  retention: 30 days
  size: ~50 GB (compressed)
  location: /backups/neo4j-YYYYMMDD.dump

restore_rto: 30 minutes
restore_steps:
  1_prepare: "Free disk space (100GB), stop staging Neo4j"
  2_load: "neo4j-admin load --from=backup.dump --force"
  3_startup: "Start Neo4j, wait for ready status"
  4_validate: "Run integrity checks, spot-check queries"
  5_switchover: "Update DNS/LB to point to staging"

testing_schedule:
  frequency: quarterly
  date: "First Thursday of each quarter"
  duration: "2 hours"
  participants: "SRE team, database lead, on-call engineer"

contacts:
  database_lead: "[email protected]"
  on_call: "See PagerDuty schedule"
  escalation: "#database-incidents on Slack"

Observability

Metrics (Prometheus + Grafana)

Enable JMX exporter:

# values-prod.yaml
jmx:
  enabled: true
  port: 9090

# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: neo4j
  namespace: neo4j
spec:
  selector:
    matchLabels:
      app: neo4j
  endpoints:
    - port: metrics
      interval: 30s

Key metrics to track:

# Neo4j-specific
neo4j_jvm_heap_used_bytes
neo4j_jvm_gc_time_seconds
neo4j_database_transactions_open
neo4j_query_execution_seconds  # P50, P99
neo4j_page_cache_fault_count
neo4j_cluster_replication_lag_seconds

# Application-level SLOs
query_latency_p50: 50ms
query_latency_p99: 500ms
query_timeout_rate: <0.1%

Logging

# values-prod.yaml
logging:
  enabled: true
  level: INFO
  slowLogThreshold: 1000  # Log queries slower than 1s

# Structured logging (JSON)
environment:
  NEO4J_dbms_logs_query_enabled: "true"
  NEO4J_dbms_logs_query_threshold: "1000ms"

Sample slow queries:

timestamp=2025-01-15T10:30:45Z query="MATCH (n)-[:KNOWS*10]-(m) RETURN m" 
duration=2450ms parameters={} client=127.0.0.1

Best Practices Checklist (Top 12)

Use parameterized queries to prevent injection attacks
Index frequently queried properties (name, email, external IDs)
Create uniqueness constraints where applicable (prevents duplicates)
Size page cache to ~80% of dataset for optimal performance
Monitor page cache hit rate (target >95%)
Run PROFILE on slow queries to identify inefficiencies
Batch writes with PERIODIC COMMIT (1000-5000 per batch)
Use Causal Cluster for HA (3+ nodes recommended)
Enable TLS for all client connections (in-flight encryption)
Implement network policies (restrict pod-to-pod access)
Daily backups with monthly restore tests (verify RPO/RTO)
Set resource limits (heap, pagecache, CPU) to prevent OOM

Top Pitfalls to Avoid

Pitfall	Impact	Solution
Super-nodes (too many relationships)	Traversal slowdown	Redesign to use intermediate nodes
Missing indexes	Query timeout	Profile queries, add indexes proactively
Unbounded traversals (no LIMIT)	Memory exhaustion	Use LIMIT, constrain depth (*1..3)
String queries (concatenation)	SQL injection	Use parameters always
Wrong memory split (too much heap)	Page cache thrashing	Follow sizing formula: 1/5 heap, rest pagecache
No backups	Data loss	Automate daily backups, test recovery
Running Enterprise without license	Legal/support issues	Purchase or use Community edition
Insufficient replication	Split-brain scenarios	Use 3+ node Causal Cluster, monitor replication lag
Skipping version upgrades	Security vulnerabilities	Plan quarterly upgrades, test on staging

CI/CD Example (Helm Deployment)

GitOps with ArgoCD:

# argocd-neo4j-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: neo4j
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://helm.neo4j.com/neo4j
    chart: neo4j
    targetRevision: "5.15.*"  # Pin minor version
    helm:
      releaseName: neo4j
      valuesObject:
        clusterSize: 3
        mode: CLUSTER
        jvm:
          heapInitialSize: 2G
          heapMaxSize: 4G
  destination:
    server: https://kubernetes.default.svc
    namespace: neo4j
  syncPolicy:
    automated:
      prune: false  # Manual approval for destructive changes
      selfHeal: true
    syncOptions:
      - Validate=true

Blue/Green Cluster Switch:

# Deploy new cluster (blue)
helm install neo4j-blue neo4j/neo4j \
  -f values-blue.yaml -n neo4j-blue

# Run smoke tests
kubectl run test-pod --image=neo4j-cypher-shell \
  -- -a neo4j-blue-0.neo4j-blue.neo4j-blue.svc.cluster.local \
  "CALL dbms.cluster.overview()"

# Switch service endpoint
kubectl patch service neo4j-router -p '{"spec":{"selector":{"app":"neo4j-blue"}}}'

# Keep green for quick rollback
helm uninstall neo4j-green

Architecture Diagram (Causal Cluster + K8s)

┌─────────────────────────────────────────────────────────────────────┐
│                        Kubernetes Cluster                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │            Ingress (HTTPS/TLS Termination)                  │ │
│  │  neo4j.example.com:443 → ClusterIP:7687                     │ │
│  └────────────────────┬─────────────────────────────────────────┘ │
│                       │ Load balance                               │
│       ┌───────────────┼───────────────┐                           │
│       │               │               │                           │
│  ┌────▼────────┐ ┌───▼─────────┐ ┌──▼──────────┐               │
│  │  StatefulSet: neo4j-0       │ │ neo4j-1     │ │ neo4j-2      │               │
│  │  PRIMARY                    │ │ FOLLOWER    │ │ FOLLOWER     │               │
│  │ ┌──────────────────────┐   │ │ ┌────────┐  │ │ ┌────────┐  │               │
│  │ │Pod: neo4j-0          │   │ │ │Pod: 1  │  │ │ │Pod: 2  │  │               │
│  │ │ ┌────────────────┐   │   │ │ └────────┘  │ │ └────────┘  │               │
│  │ │ │neo4j:5.15      │   │   │ │  CPU: 2    │ │  CPU: 2    │               │
│  │ │ │Memory: 10Gi    │   │   │ │  Mem: 10Gi │ │  Mem: 10Gi │               │
│  │ │ │Heap: 4Gi       │   │   │ │            │ │            │               │
│  │ │ │PageCache: 4Gi  │   │   │ │            │ │            │               │
│  │ │ └────────────────┘   │   │ │            │ │            │               │
│  │ │ PVC: 100Gi (SSD)     │   │ │ PVC: 100Gi │ │ PVC: 100Gi │               │
│  │ └──────────────────────┘   │ │ SSD        │ │ SSD        │               │
│  │ Service (Headless)         │ └────────────┘ └────────────┘               │
│  │ neo4j-0.neo4j.svc.cluster │                                              │
│  │ neo4j-1.neo4j.svc.cluster │ Causal Cluster (Raft Replication)           │
│  │ neo4j-2.neo4j.svc.cluster │ Follower-to-Primary lag: <1s                │
│  └────────────────────────────┘                                              │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │           Backup Sidecar (CronJob)                           │ │
│  │  Runs daily 2 AM                                             │ │
│  │  neo4j-admin dump → S3/NFS → 30 GB compressed              │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │           Observability Stack                                │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │ │
│  │  │ Prometheus   │  │ Grafana      │  │ Loki Logs    │      │ │
│  │  │ (Metrics)    │  │ (Dashboards) │  │ (Query logs) │      │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘      │ │
│  └──────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Conclusion

Neo4j excels for relationship-heavy workloads at scale. Key takeaways:

Production deployment: Kubernetes + Causal Cluster (3+ nodes)
Performance: Size page cache to dataset, monitor hit rate
Security: TLS + parameterized queries + RBAC
Backups: Daily dumps, test recovery quarterly
Observability: JMX metrics, slow query logs, health probes

For SRE teams: Treat Neo4j like any production database—version pin, backup test, alert on replication lag, and assume failure will happen.

Executive Summary#

What is Neo4j?#

Core Concepts#

When to Use Neo4j#

Neo4j vs Other Databases#

Core Model & Queries#

Cypher Essentials#

Data Modeling Best Practices#

Deployment Options#

1. Managed: Neo4j Aura#

2. Self-Managed: Docker Compose (Development)#

3. Self-Managed: Kubernetes + Helm (Production)#

Architecture: Causal Cluster (3 nodes)#

Rolling Upgrades (Zero Downtime)#

Performance & Sizing#

Memory Configuration#

Query Optimization#

Bulk Data Import#

Security#

Authentication & Authorization#

Kubernetes Secret Management#

TLS/MUTTLS#

Network Policies#

Backups & Disaster Recovery#

Why Backups Matter#

Online Backups (Recommended)#

Step 1: Configure Backup Storage#

Step 2: Create Backup Script#

Step 3: Automate with Kubernetes CronJob#

Snapshot Strategy (Kubernetes - Faster Recovery)#

Taking Snapshots#

Recovery Test (Quarterly Mandatory)#

Step 1: Pre-Recovery Checklist#

Step 2: Restore to Staging#

Step 3: Validate Data Integrity#

Step 4: Document RPO/RTO#

Complete Backup & Recovery Runbook#

Observability#

Metrics (Prometheus + Grafana)#

Logging#

Best Practices Checklist (Top 12)#

Top Pitfalls to Avoid#

CI/CD Example (Helm Deployment)#

Architecture Diagram (Causal Cluster + K8s)#

Conclusion#