Executive Summary
Neo4j is a native graph database that stores data as nodes (entities) connected by relationships (edges). Unlike relational databases that normalize data into tables, Neo4j excels at traversing relationships.
Quick decision:
- Use Neo4j for: Knowledge graphs, authorization/identity, recommendations, fraud detection, network topology, impact analysis
- Don’t use for: Heavy OLAP analytics, simple key-value workloads, document storage
Production deployment: Kubernetes + Helm (managed) or Docker Compose + Causal Cluster (self-managed)
What is Neo4j?
Core Concepts
Labeled Property Graph Model:
Node (entity) Relationship (connection)
βββββββββββββββ ββββββββββββββββββ
β :Person β β KNOWS (since) β
β name: "Bob" βββββββββΆβ for: 5 years β
β age: 30 β ββββββββββββββββββ
βββββββββββββββ
- Nodes: Entities with labels and properties
- Relationships: Typed, directed connections between nodes
- Labels: Categories (e.g.,
:Person
,:Company
) - Properties: Key-value data on nodes and relationships
- Cypher: SQL-like query language for graphs
When to Use Neo4j
Best use cases:
Use Case | Example | Why Neo4j |
---|---|---|
Authorization/Identity | Who can access what? | Fast relationship traversal (no JOINs) |
Knowledge Graphs | Knowledge bases, semantic search | Hierarchies + connections |
Recommendations | “Customers like you also liked…” | Pattern matching on user behavior |
Fraud Detection | Ring detection, money flows | Detect loops & suspicious patterns |
Network/Topology | Infrastructure dependencies, blast radius | Fast path queries |
Impact Analysis | “If this service fails, what breaks?” | Upstream/downstream traversal |
Don’t use Neo4j for:
- Heavy OLAP analytics (use ClickHouse, BigQuery)
- Simple key-value cache (use Redis)
- Document storage (use MongoDB)
- Time-series data (use InfluxDB, Prometheus)
Neo4j vs Other Databases
Aspect | Neo4j | PostgreSQL | MongoDB | Redis |
---|---|---|---|---|
Data Model | Graph (nodes + relationships) | Relational (tables) | Document (JSON) | Key-value |
Query Pattern | Traverse relationships | JOIN heavy tables | Nested documents | Simple keys |
Join Cost | O(1) β direct pointers | O(n log n) β index scan | O(n) β scan | N/A |
Real-time graph queries | β Fast | β Slow (many JOINs) | β Slow | β Can’t do this |
Transactions | β ACID | β ACID | β ACID | Limited |
Scalability | Vertical (clusters available) | Horizontal (sharding hard) | Horizontal (sharding built-in) | Horizontal (easy) |
Core Model & Queries
Cypher Essentials
CREATE nodes and relationships:
-- Create node with label and properties
CREATE (n:Person {name: "Alice", age: 30})
-- Create relationship
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS {since: 2020}]->(b)
-- Shorthand: CREATE with multiple nodes
CREATE (alice:Person {name: "Alice"})-[:KNOWS]->(bob:Person {name: "Bob"})
MATCH queries (read patterns):
-- Simple pattern
MATCH (p:Person {name: "Alice"})
RETURN p
-- Traverse relationships
MATCH (alice:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name
-- Multi-hop traversal (up to 3 steps)
MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..3]->(distant_friend)
RETURN distant_friend.name, distance
-- Find shortest path
MATCH path = shortestPath((a:Person)-[:KNOWS*]-(b:Person))
WHERE a.name = "Alice" AND b.name = "Charlie"
RETURN path
MERGE (upsert):
-- Update if exists, create if not
MERGE (p:Person {email: "[email protected]"})
ON CREATE SET p.created_at = timestamp()
ON MATCH SET p.updated_at = timestamp()
SET p.age = 31
RETURN p
Using parameters (IMPORTANT for security):
-- GOOD: Parameterized (prevents injection)
MATCH (p:Person {email: $email})
RETURN p
-- Query with: {email: "[email protected]"}
-- BAD: String interpolation (vulnerable)
MATCH (p:Person {email: "[email protected]"})
-- Don't do this!
Data Modeling Best Practices
Anti-pattern: Super-nodes
-- BAD: Everything connects to one node
(:Transaction)-[:INVOLVES]->(:Company)
(:Account)-[:BELONGS_TO]->(:Company)
(:Customer)-[:WORKS_FOR]->(:Company)
-- This creates a bottleneck; traversing the "Company" node is slow
-- GOOD: Use intermediate nodes and directions
(:Transaction)-[:TO_ACCOUNT]->(:Account)
(:Account)-[:AT_COMPANY]->(:Company)
(:Customer)-[:HAS_ACCOUNT]->(:Account)
Relationship directions matter:
-- Bad: Bidirectional or unclear
(a)-[:RELATED]-(b)
-- Good: Clear, purposeful direction
(author:Person)-[:WROTE]->(post:Post)
(post:Post)-[:POSTED_BY]->(author:Person) -- or store in one direction only
Use constraints and indexes:
-- Create uniqueness constraint (also creates index)
CREATE CONSTRAINT email_unique ON (p:Person) ASSERT p.email IS UNIQUE
-- Create index for faster lookups
CREATE INDEX person_name ON :Person(name)
-- Create composite index
CREATE INDEX user_company ON (u:User)(company, role)
Deployment Options
1. Managed: Neo4j Aura
Pros:
- Zero operations; fully managed
- Auto backups, updates, scaling
- Global deployment (multi-region)
Cons:
- Higher cost ($100+/month minimum)
- Less control (no custom plugins)
- API-driven only (no direct cluster access)
Best for: SaaS companies, startups, teams without Kubernetes expertise
2. Self-Managed: Docker Compose (Development)
Quick start for local testing:
# docker-compose.yml
version: '3.8'
services:
neo4j:
image: neo4j:5.15-enterprise # Use community for non-production
container_name: neo4j
ports:
- "7474:7474" # Browser UI
- "7687:7687" # Bolt protocol
environment:
NEO4J_AUTH: neo4j/password # Change this!
NEO4J_server_memory_heap_initial__size: 2G
NEO4J_server_memory_heap_max__size: 4G
NEO4J_server_memory_pagecache_size: 4G
volumes:
- neo4j_data:/var/lib/neo4j/data
- neo4j_logs:/var/lib/neo4j/logs
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7474"]
interval: 10s
timeout: 5s
retries: 5
volumes:
neo4j_data:
neo4j_logs:
Run:
docker-compose up -d
# Access at http://localhost:7474
# Default username: neo4j, password: password
3. Self-Managed: Kubernetes + Helm (Production)
Architecture: Causal Cluster (3 nodes)
βββββββββββββββββββββββββββ
β Kubernetes Service β
β (Load Balancer/Ingress)β
ββββββββββββ¬βββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
β β β
ββββββββββΌβββββββββ ββββΌβββββββββββββ ββββββββββββββββ
β PRIMARY/LEADER β β FOLLOWER β β FOLLOWER β
β (Write ops) β β (Read replicas)β β(Read replicas)β
ββββββββββ¬βββββββββ ββββ¬βββββββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββ΄ββββββββββββββββββββββ
Replication (15 sec catchup)
ββββββββββββββββββββββββββββ
β Backup (Neo4j Backup β
β sidecar pod) β
ββββββββββββββββββββββββββββ
Helm values (minimal production setup):
# values-prod.yaml
# Image
image:
repository: neo4j
tag: 5.15-enterprise
pullPolicy: IfNotPresent
# Licensing (required for Enterprise)
neo4j:
# Get this from Neo4j: neo4j-com/neo4j-licensing
licenseKey: "YOUR_LICENSE_KEY_HERE"
# Cluster mode
mode: CLUSTER
clusterSize: 3
# Memory allocation
jvm:
heapInitialSize: 2G
heapMaxSize: 4G
pagecacheSizeGB: 4G
# Persistent volumes
volumes:
data:
mode: volumeClaimTemplate
spec:
storageClassName: fast-ssd # Use SSD for better performance
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
logs:
mode: volumeClaimTemplate
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
# Resource limits
resources:
requests:
cpu: 2
memory: 10Gi
limits:
cpu: 4
memory: 12Gi
# Authentication
auth:
enabled: true
username: neo4j
password: "SecurePassword123!" # Use Kubernetes secret instead!
# TLS for in-flight encryption
tls:
enabled: true
privateKey:
secretName: neo4j-tls-key
keyName: tls.key
certificate:
secretName: neo4j-tls-cert
certName: tls.crt
# Health checks
healthCheck:
liveness:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
readiness:
enabled: true
initialDelaySeconds: 15
periodSeconds: 10
# Service
service:
type: ClusterIP
ports:
http: 7474
https: 7473
bolt: 7687
boltSSL: 7688
# Ingress
ingress:
enabled: true
className: nginx
hosts:
- host: neo4j.example.com
paths:
- path: /
pathType: Prefix
# Backups (sidecar)
backup:
enabled: true
schedule: "0 2 * * *" # 2 AM daily
volumeSize: 50Gi
Installation:
# Add Neo4j Helm repository
helm repo add neo4j https://helm.neo4j.com/neo4j
helm repo update
# Install with production values
helm install neo4j neo4j/neo4j \
--namespace neo4j \
--create-namespace \
-f values-prod.yaml \
--set auth.password=$(kubectl create secret generic neo4j-auth --from-literal=password=$PASSWORD -o jsonpath='{.data.password}' | base64 -d)
# Verify cluster
kubectl logs -n neo4j neo4j-0
kubectl logs -n neo4j neo4j-1
kubectl logs -n neo4j neo4j-2
# Port-forward to test
kubectl port-forward -n neo4j neo4j-0 7687:7687
# Connect with Cypher Shell: cypher-shell -a bolt://localhost:7687 -u neo4j -p password
Rolling Upgrades (Zero Downtime)
# 1. Check current version
kubectl get statefulset neo4j -n neo4j -o jsonpath='{.spec.template.spec.containers[0].image}'
# 2. Trigger rolling update (Kubernetes handles replicas automatically)
kubectl set image statefulset/neo4j \
neo4j=neo4j:5.16-enterprise \
-n neo4j
# 3. Watch rollout
kubectl rollout status statefulset/neo4j -n neo4j
# 4. Verify all nodes rejoined cluster
kubectl exec -it neo4j-0 -n neo4j -- cypher-shell \
"CALL dbms.cluster.overview()" | grep FOLLOWER
Performance & Sizing
Memory Configuration
Neo4j has two main memory regions:
Heap: Runtime objects, query execution Page cache: On-disk data in memory (like OS filesystem cache)
Sizing formula:
Total machine memory: 32GB
Heap: 4GB (5x less than page cache)
Page cache: 16GB (size of your hot dataset)
OS + overhead: 12GB (for system + other processes)
Calculation:
# If dataset is 50GB, ~10GB hot:
heap_size: 2G
pagecache_size: 8G
total_needed: 10G per instance
Check page cache hit rate:
CALL dbms.queryJmx("java.lang:type=Memory") YIELD value
RETURN value
-- Look for: NonHeapMemoryUsage (page cache efficiency)
Query Optimization
Always use EXPLAIN or PROFILE:
-- EXPLAIN: Show execution plan (no execution)
EXPLAIN MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f
-- PROFILE: Execute and show actual stats
PROFILE MATCH (p:Person)-[:KNOWS]-(f) WHERE f.age > 30 RETURN f
-- Look for: db hits, rows, execution time
Common anti-patterns:
-- BAD: Scanning all nodes (expensive)
MATCH (p:Person) WHERE p.age > 30 RETURN p
-- Better: Create index on age
-- BAD: Large cartesian products (massive fan-out)
MATCH (a), (b) RETURN a, b -- Don't do this!
-- BAD: Late filtering
MATCH (p:Person)-[:LIKES*10]-(other) RETURN other
-- Better: Add WHERE earlier to prune
-- GOOD: Index + early filtering
MATCH (p:Person {age: 30})-[:KNOWS]-(friend)
RETURN friend
Bulk Data Import
Option 1: LOAD CSV (flexible, slower)
-- Load from URL or local file
LOAD CSV WITH HEADERS FROM "file:///data/people.csv" AS row
CREATE (p:Person {
name: row.name,
age: toInteger(row.age),
email: row.email
})
-- Batch with PERIODIC COMMIT
:auto LOAD CSV WITH HEADERS FROM "file:///data/relationships.csv" AS row
CALL apoc.periodic.commit('
MERGE (a:Person {id: row.from})
MERGE (b:Person {id: row.to})
CREATE (a)-[:KNOWS]->(b)
', {batchSize: 1000})
Option 2: neo4j-admin import (fastest, bulk only)
# Prepare CSV files with specific format
# nodes-people.csv:
# id:ID,name:STRING,age:INT,:LABEL
# 1,Alice,30,Person
# 2,Bob,25,Person
# relationships-knows.csv:
# :START_ID,:END_ID,:TYPE,since:INT
# 1,2,KNOWS,2020
# 2,1,KNOWS,2020
# Stop Neo4j, run import
docker exec neo4j neo4j-admin import \
--nodes=/data/nodes-people.csv \
--relationships=/data/relationships-knows.csv \
--database=neo4j \
--force
# Restart
docker restart neo4j
Security
Authentication & Authorization
-- Create role with minimal permissions
CREATE ROLE analyst;
GRANT READ ON GRAPH neo4j TO analyst;
-- Create user with role
CREATE USER alice SET PASSWORD 'SecurePass123!' CHANGE REQUIRED;
GRANT ROLE analyst TO alice;
-- Grant database-level access
GRANT ACCESS ON DATABASE neo4j TO analyst;
-- Test as user
:logout
:param username => "alice"
:param password => "SecurePass123!"
Kubernetes Secret Management
Store credentials in Kubernetes secrets:
# Create secret
kubectl create secret generic neo4j-auth \
--from-literal=username=neo4j \
--from-literal=password=$(openssl rand -base64 32) \
-n neo4j
# Reference in Helm values
auth:
enabled: true
username: neo4j
passwordFromSecret:
name: neo4j-auth
key: password
TLS/MUTTLS
# values-prod.yaml
tls:
enabled: true
privateKey:
secretName: neo4j-tls-key
keyName: tls.key
certificate:
secretName: neo4j-tls-cert
certName: tls.crt
# Generate self-signed cert
openssl req -x509 -newkey rsa:4096 -keyout tls.key -out tls.crt -days 365 -nodes \
-subj "/CN=neo4j.default.svc.cluster.local"
kubectl create secret tls neo4j-tls-cert \
--cert=tls.crt --key=tls.key -n neo4j
Network Policies
# NetworkPolicy: Only allow from application pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: neo4j-network-policy
namespace: neo4j
spec:
podSelector:
matchLabels:
app: neo4j
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: applications
- podSelector:
matchLabels:
role: neo4j-client
ports:
- protocol: TCP
port: 7687
Backups & Disaster Recovery
Why Backups Matter
Real scenario: Database corruption or data loss
Monday 2 AM: Automated backup completes (50GB compressed)
Monday 3 PM: Accidental DELETE query wipes 30% of nodes
Monday 4 PM: Incident discovered via monitoring alert
Monday 5 PM: Restore from backup, validate data
Monday 6 PM: System back online
Without backup:
- Data permanently lost (30% of production data gone)
- Potential legal liability
- Manual recovery from logs (days of work, error-prone)
Online Backups (Recommended)
What is an online backup?
- Snapshot of database while it’s running
- No downtime (clients can still read/write)
- Consistent point-in-time copy
- Can be stored locally, S3, NFS, etc.
Step 1: Configure Backup Storage
Option A: Local Storage (Docker)
# Create backup directory on host
mkdir -p /backups/neo4j
chmod 755 /backups/neo4j
# Mount in docker-compose.yml
volumes:
- neo4j_data:/var/lib/neo4j/data
- /backups/neo4j:/backups # Mount backup directory
Option B: Kubernetes PersistentVolume
# Create a backup volume for storing dumps
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: neo4j-backups
namespace: neo4j
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 100Gi # 100GB for backups (2-3 weeks of daily backups)
Step 2: Create Backup Script
The backup process:
#!/bin/bash
# File: neo4j-backup.sh
# Purpose: Daily backup of Neo4j database
set -e # Exit on error
BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="neo4j-${DATE}.dump"
LOG_FILE="/var/log/neo4j-backup-${DATE}.log"
# Step 1: Check disk space before starting
AVAILABLE_SPACE=$(df $BACKUP_DIR | tail -1 | awk '{print $4}')
ESTIMATED_SIZE=50000000 # ~50GB in KB
if [ $AVAILABLE_SPACE -lt $ESTIMATED_SIZE ]; then
echo "ERROR: Insufficient disk space! Available: ${AVAILABLE_SPACE}KB, Need: ${ESTIMATED_SIZE}KB" | tee $LOG_FILE
# Send alert to monitoring
curl -X POST $ALERT_WEBHOOK -d "Neo4j backup failed: disk full"
exit 1
fi
# Step 2: Create backup
echo "[$(date)] Starting Neo4j backup..." | tee $LOG_FILE
docker exec neo4j neo4j-admin dump \
--database=neo4j \
--to=${BACKUP_DIR}/${BACKUP_FILE} \
2>&1 | tee -a $LOG_FILE
BACKUP_SIZE=$(du -h ${BACKUP_DIR}/${BACKUP_FILE} | cut -f1)
# Step 3: Verify backup integrity
echo "[$(date)] Verifying backup integrity..." | tee -a $LOG_FILE
if [ -f "${BACKUP_DIR}/${BACKUP_FILE}" ]; then
echo "[$(date)] Backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})" | tee -a $LOG_FILE
else
echo "ERROR: Backup file not created!" | tee -a $LOG_FILE
exit 1
fi
# Step 4: Upload to remote storage (S3)
echo "[$(date)] Uploading to S3..." | tee -a $LOG_FILE
aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
s3://company-backups/neo4j/ \
--storage-class STANDARD_IA \ # Cheaper for archival
2>&1 | tee -a $LOG_FILE
# Step 5: Clean up old local backups (keep 7 days)
echo "[$(date)] Cleaning up old backups (>7 days)..." | tee -a $LOG_FILE
find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete
# Step 6: Update backup manifest
echo "[$(date)] Updating backup manifest..." | tee -a $LOG_FILE
echo "${DATE} ${BACKUP_FILE} ${BACKUP_SIZE}" >> ${BACKUP_DIR}/manifest.log
# Step 7: Send success notification
echo "[$(date)] Backup completed successfully!" | tee -a $LOG_FILE
curl -X POST $SUCCESS_WEBHOOK \
-d "Neo4j backup successful: ${BACKUP_FILE} (${BACKUP_SIZE})"
What each step does:
- Check disk space - Prevent backup failures due to full disk
- Create dump -
neo4j-admin dump
creates consistent snapshot - Verify file exists - Confirm backup actually created
- Upload to S3 - Remote copy for disaster recovery
- Clean up old backups - Free local disk space (keep 7 days)
- Update manifest - Track backup history
- Alert on success - Monitor that backups are happening
Step 3: Automate with Kubernetes CronJob
---
apiVersion: v1
kind: ConfigMap
metadata:
name: neo4j-backup-script
namespace: neo4j
data:
backup.sh: |
#!/bin/bash
set -e
BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="neo4j-${DATE}.dump"
echo "Starting backup at $(date)"
# Create backup (must connect to neo4j-0 service)
neo4j-admin dump \
--database=neo4j \
--to=${BACKUP_DIR}/${BACKUP_FILE}
echo "Backup completed: ${BACKUP_FILE}"
# Upload to S3
aws s3 cp ${BACKUP_DIR}/${BACKUP_FILE} \
s3://company-neo4j-backups/
# Cleanup old backups
find ${BACKUP_DIR} -name "neo4j-*.dump" -mtime +7 -delete
---
# CronJob that runs daily at 2 AM UTC
apiVersion: batch/v1
kind: CronJob
metadata:
name: neo4j-daily-backup
namespace: neo4j
spec:
# Schedule: minute hour day month dayOfWeek
# 0 2 * * * = Every day at 2 AM
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: neo4j-backup
containers:
- name: neo4j-admin
image: neo4j:5.15-enterprise
command:
- /bin/bash
- -c
- |
#!/bin/bash
set -e
# Wait for Neo4j to be ready
until cypher-shell -a neo4j-0.neo4j.neo4j.svc.cluster.local \
"RETURN 1"; do
echo "Waiting for Neo4j to be ready..."
sleep 10
done
# Create backup
neo4j-admin dump \
--database=neo4j \
--to=/backups/neo4j-$(date +%Y%m%d).dump
# Upload to S3
aws s3 cp /backups/neo4j-*.dump \
s3://company-neo4j-backups/
# Cleanup
find /backups -name "neo4j-*.dump" -mtime +7 -delete
echo "Backup completed successfully"
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access_key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret_key
volumeMounts:
- name: backups
mountPath: /backups
volumes:
- name: backups
persistentVolumeClaim:
claimName: neo4j-backups
# Don't retry if backup fails (manual investigation needed)
restartPolicy: Never
# Keep backup jobs for 7 days
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 7
Snapshot Strategy (Kubernetes - Faster Recovery)
When to use snapshots:
- Faster recovery than full dump restoration
- Point-in-time consistency
- Good for infrastructure failures
When to use dumps:
- Cross-region disaster recovery
- Long-term archival
- Database corruption (need to analyze)
Taking Snapshots
---
# StorageClass for snapshots (CSI driver must be installed)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: neo4j-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
---
# Manual snapshot trigger (or use automation)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: neo4j-data-snapshot-$(date +%Y%m%d)
namespace: neo4j
spec:
volumeSnapshotClassName: neo4j-snapshot-class
source:
persistentVolumeClaimName: neo4j-data-neo4j-0 # Snapshot primary node
---
# Automated snapshots with CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: neo4j-hourly-snapshot
namespace: neo4j
spec:
schedule: "0 * * * *" # Every hour
jobTemplate:
spec:
template:
spec:
serviceAccountName: neo4j-snapshot
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
DATE=$(date +%Y%m%d-%H%M%S)
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: neo4j-data-snapshot-${DATE}
namespace: neo4j
spec:
volumeSnapshotClassName: neo4j-snapshot-class
source:
persistentVolumeClaimName: neo4j-data-neo4j-0
EOF
restartPolicy: Never
Snapshot cleanup:
# Keep only last 72 hours of snapshots (delete older ones)
kubectl delete volumesnapshot \
-n neo4j \
$(kubectl get volumesnapshot -n neo4j \
--sort-by=.metadata.creationTimestamp \
-o name | head -n -3)
Recovery Test (Quarterly Mandatory)
Why test recovery?
- Backups are worthless if you can’t restore
- Discover issues before real disaster
- Validate RPO/RTO targets
- Train team on recovery procedures
Step 1: Pre-Recovery Checklist
#!/bin/bash
# Pre-recovery validation
echo "=== Pre-Recovery Validation ==="
# Check backup integrity
BACKUP_FILE="/backups/neo4j-20250101.dump"
if [ ! -f "$BACKUP_FILE" ]; then
echo "ERROR: Backup file not found: $BACKUP_FILE"
exit 1
fi
BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
BACKUP_TIME=$(stat -f "%Sm" "$BACKUP_FILE")
echo "β Backup exists: $BACKUP_FILE ($BACKUP_SIZE, created $BACKUP_TIME)"
# Check staging environment has space
STAGING_SPACE=$(df /staging | tail -1 | awk '{print $4}')
REQUIRED_SPACE=$((BACKUP_SIZE * 3)) # Need 3x space (dump + unpacked + logs)
if [ $STAGING_SPACE -lt $REQUIRED_SPACE ]; then
echo "ERROR: Insufficient space on staging!"
exit 1
fi
echo "β Staging has sufficient disk space"
# Verify staging Neo4j is stopped
docker ps | grep -q neo4j-staging && docker stop neo4j-staging
echo "β Staging Neo4j stopped"
echo ""
echo "Pre-recovery checks complete. Ready to restore."
Step 2: Restore to Staging
#!/bin/bash
# Restore backup to staging environment
BACKUP_FILE="/backups/neo4j-20250101.dump"
START_TIME=$(date +%s)
echo "=== Starting Restore Process ==="
echo "Backup: $BACKUP_FILE"
echo "Start time: $(date)"
# Step 1: Clear staging database
echo ""
echo "Step 1: Clearing staging database..."
rm -rf /staging/neo4j/data/databases/*
rm -rf /staging/neo4j/data/transactions/*
# Step 2: Load backup
echo "Step 2: Loading backup into staging (this may take 10-30 minutes)..."
docker exec neo4j-staging neo4j-admin load \
--database=neo4j \
--from=$BACKUP_FILE \
--force 2>&1 | tee restore.log
LOAD_STATUS=$?
if [ $LOAD_STATUS -ne 0 ]; then
echo "ERROR: Restore failed! Check restore.log"
exit 1
fi
echo "β Backup loaded successfully"
# Step 3: Start staging Neo4j
echo "Step 3: Starting staging Neo4j..."
docker start neo4j-staging
# Wait for startup
sleep 30
# Step 4: Verify connectivity
echo "Step 4: Verifying Neo4j is responding..."
RETRY_COUNT=0
until docker exec neo4j-staging cypher-shell \
-u neo4j -p password "RETURN 1" > /dev/null 2>&1; do
if [ $RETRY_COUNT -gt 30 ]; then
echo "ERROR: Neo4j failed to start!"
exit 1
fi
echo " Waiting for Neo4j startup... (attempt $((RETRY_COUNT+1))/30)"
sleep 10
((RETRY_COUNT++))
done
echo "β Neo4j is responding to queries"
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
echo ""
echo "=== Restore Complete ==="
echo "Duration: $((DURATION / 60)) minutes $((DURATION % 60)) seconds"
echo "Staging database ready for validation"
Step 3: Validate Data Integrity
#!/bin/bash
# Validate restored data
echo "=== Validating Restored Data ==="
# Validation 1: Count total nodes
echo "Validation 1: Node count..."
NODE_COUNT=$(docker exec neo4j-staging cypher-shell \
-u neo4j -p password \
"MATCH (n) RETURN count(n) as count" \
--format=csv | tail -1)
echo " Total nodes: $NODE_COUNT"
[ "$NODE_COUNT" -gt 0 ] && echo " β Nodes present" || exit 1
# Validation 2: Check relationships
echo "Validation 2: Relationship count..."
REL_COUNT=$(docker exec neo4j-staging cypher-shell \
-u neo4j -p password \
"MATCH ()-[r]->() RETURN count(r) as count" \
--format=csv | tail -1)
echo " Total relationships: $REL_COUNT"
[ "$REL_COUNT" -gt 0 ] && echo " β Relationships present" || exit 1
# Validation 3: Check critical data
echo "Validation 3: Critical data check (Person nodes)..."
PERSON_COUNT=$(docker exec neo4j-staging cypher-shell \
-u neo4j -p password \
"MATCH (p:Person) RETURN count(p) as count" \
--format=csv | tail -1)
echo " Person nodes: $PERSON_COUNT"
[ "$PERSON_COUNT" -gt 0 ] && echo " β Critical data intact" || exit 1
# Validation 4: Check no orphaned data
echo "Validation 4: Checking for data consistency..."
docker exec neo4j-staging cypher-shell \
-u neo4j -p password \
"MATCH (n) WHERE n.created_at IS NULL AND labels(n) <> []
RETURN count(n) as orphaned"
# Validation 5: Performance spot-check
echo "Validation 5: Performance check..."
QUERY_TIME=$(docker exec neo4j-staging cypher-shell \
-u neo4j -p password \
"PROFILE MATCH (p:Person)-[:KNOWS*1..3]-(friend)
RETURN count(distinct friend) as count" \
--format=csv | grep "Plan" | tail -1)
echo " Sample query execution: $QUERY_TIME"
echo " β Queries executing normally"
echo ""
echo "=== All Validations Passed ==="
Step 4: Document RPO/RTO
#!/bin/bash
# Document actual recovery metrics
echo "=== Disaster Recovery Metrics ==="
echo ""
echo "RPO (Recovery Point Objective):"
echo " - Daily backups at 2 AM UTC"
echo " - Maximum data loss: 24 hours"
echo " - Last backup: $(ls -lt /backups/neo4j-*.dump | head -1 | awk '{print $6, $7, $8}')"
echo ""
echo "RTO (Recovery Time Objective):"
echo " - Restore time: ~20 minutes (for 50GB backup)"
echo " - Validation time: ~5 minutes"
echo " - Total RTO: ~30 minutes from decision to restore"
echo ""
echo "DR Readiness Checklist:"
echo " β Backups automated (daily)"
echo " β Backups tested (quarterly)"
echo " β Restore procedure documented"
echo " β Team trained on recovery"
echo " β RPO/RTO targets defined and met"
echo ""
echo "Next DR Test: $(date -d '+3 months' +%Y-%m-%d)"
Complete Backup & Recovery Runbook
Quick reference for incidents:
# neo4j-dr-runbook.yaml
backup:
frequency: daily
time: 02:00 UTC
method: neo4j-admin dump
destination: s3://company-neo4j-backups/
retention: 30 days
size: ~50 GB (compressed)
location: /backups/neo4j-YYYYMMDD.dump
restore_rto: 30 minutes
restore_steps:
1_prepare: "Free disk space (100GB), stop staging Neo4j"
2_load: "neo4j-admin load --from=backup.dump --force"
3_startup: "Start Neo4j, wait for ready status"
4_validate: "Run integrity checks, spot-check queries"
5_switchover: "Update DNS/LB to point to staging"
testing_schedule:
frequency: quarterly
date: "First Thursday of each quarter"
duration: "2 hours"
participants: "SRE team, database lead, on-call engineer"
contacts:
database_lead: "[email protected]"
on_call: "See PagerDuty schedule"
escalation: "#database-incidents on Slack"
Observability
Metrics (Prometheus + Grafana)
Enable JMX exporter:
# values-prod.yaml
jmx:
enabled: true
port: 9090
# ServiceMonitor (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: neo4j
namespace: neo4j
spec:
selector:
matchLabels:
app: neo4j
endpoints:
- port: metrics
interval: 30s
Key metrics to track:
# Neo4j-specific
neo4j_jvm_heap_used_bytes
neo4j_jvm_gc_time_seconds
neo4j_database_transactions_open
neo4j_query_execution_seconds # P50, P99
neo4j_page_cache_fault_count
neo4j_cluster_replication_lag_seconds
# Application-level SLOs
query_latency_p50: 50ms
query_latency_p99: 500ms
query_timeout_rate: <0.1%
Logging
# values-prod.yaml
logging:
enabled: true
level: INFO
slowLogThreshold: 1000 # Log queries slower than 1s
# Structured logging (JSON)
environment:
NEO4J_dbms_logs_query_enabled: "true"
NEO4J_dbms_logs_query_threshold: "1000ms"
Sample slow queries:
timestamp=2025-01-15T10:30:45Z query="MATCH (n)-[:KNOWS*10]-(m) RETURN m"
duration=2450ms parameters={} client=127.0.0.1
Best Practices Checklist (Top 12)
- Use parameterized queries to prevent injection attacks
- Index frequently queried properties (name, email, external IDs)
- Create uniqueness constraints where applicable (prevents duplicates)
- Size page cache to ~80% of dataset for optimal performance
- Monitor page cache hit rate (target >95%)
- Run PROFILE on slow queries to identify inefficiencies
- Batch writes with PERIODIC COMMIT (1000-5000 per batch)
- Use Causal Cluster for HA (3+ nodes recommended)
- Enable TLS for all client connections (in-flight encryption)
- Implement network policies (restrict pod-to-pod access)
- Daily backups with monthly restore tests (verify RPO/RTO)
- Set resource limits (heap, pagecache, CPU) to prevent OOM
Top Pitfalls to Avoid
Pitfall | Impact | Solution |
---|---|---|
Super-nodes (too many relationships) | Traversal slowdown | Redesign to use intermediate nodes |
Missing indexes | Query timeout | Profile queries, add indexes proactively |
Unbounded traversals (no LIMIT) | Memory exhaustion | Use LIMIT, constrain depth (*1..3) |
String queries (concatenation) | SQL injection | Use parameters always |
Wrong memory split (too much heap) | Page cache thrashing | Follow sizing formula: 1/5 heap, rest pagecache |
No backups | Data loss | Automate daily backups, test recovery |
Running Enterprise without license | Legal/support issues | Purchase or use Community edition |
Insufficient replication | Split-brain scenarios | Use 3+ node Causal Cluster, monitor replication lag |
Skipping version upgrades | Security vulnerabilities | Plan quarterly upgrades, test on staging |
CI/CD Example (Helm Deployment)
GitOps with ArgoCD:
# argocd-neo4j-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: neo4j
namespace: argocd
spec:
project: default
source:
repoURL: https://helm.neo4j.com/neo4j
chart: neo4j
targetRevision: "5.15.*" # Pin minor version
helm:
releaseName: neo4j
valuesObject:
clusterSize: 3
mode: CLUSTER
jvm:
heapInitialSize: 2G
heapMaxSize: 4G
destination:
server: https://kubernetes.default.svc
namespace: neo4j
syncPolicy:
automated:
prune: false # Manual approval for destructive changes
selfHeal: true
syncOptions:
- Validate=true
Blue/Green Cluster Switch:
# Deploy new cluster (blue)
helm install neo4j-blue neo4j/neo4j \
-f values-blue.yaml -n neo4j-blue
# Run smoke tests
kubectl run test-pod --image=neo4j-cypher-shell \
-- -a neo4j-blue-0.neo4j-blue.neo4j-blue.svc.cluster.local \
"CALL dbms.cluster.overview()"
# Switch service endpoint
kubectl patch service neo4j-router -p '{"spec":{"selector":{"app":"neo4j-blue"}}}'
# Keep green for quick rollback
helm uninstall neo4j-green
Architecture Diagram (Causal Cluster + K8s)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Ingress (HTTPS/TLS Termination) β β
β β neo4j.example.com:443 β ClusterIP:7687 β β
β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ β
β β Load balance β
β βββββββββββββββββΌββββββββββββββββ β
β β β β β
β ββββββΌβββββββββ βββββΌββββββββββ ββββΌβββββββββββ β
β β StatefulSet: neo4j-0 β β neo4j-1 β β neo4j-2 β β
β β PRIMARY β β FOLLOWER β β FOLLOWER β β
β β ββββββββββββββββββββββββ β β ββββββββββ β β ββββββββββ β β
β β βPod: neo4j-0 β β β βPod: 1 β β β βPod: 2 β β β
β β β ββββββββββββββββββ β β β ββββββββββ β β ββββββββββ β β
β β β βneo4j:5.15 β β β β CPU: 2 β β CPU: 2 β β
β β β βMemory: 10Gi β β β β Mem: 10Gi β β Mem: 10Gi β β
β β β βHeap: 4Gi β β β β β β β β
β β β βPageCache: 4Gi β β β β β β β β
β β β ββββββββββββββββββ β β β β β β β
β β β PVC: 100Gi (SSD) β β β PVC: 100Gi β β PVC: 100Gi β β
β β ββββββββββββββββββββββββ β β SSD β β SSD β β
β β Service (Headless) β ββββββββββββββ ββββββββββββββ β
β β neo4j-0.neo4j.svc.cluster β β
β β neo4j-1.neo4j.svc.cluster β Causal Cluster (Raft Replication) β
β β neo4j-2.neo4j.svc.cluster β Follower-to-Primary lag: <1s β
β ββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Backup Sidecar (CronJob) β β
β β Runs daily 2 AM β β
β β neo4j-admin dump β S3/NFS β 30 GB compressed β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Observability Stack β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Prometheus β β Grafana β β Loki Logs β β β
β β β (Metrics) β β (Dashboards) β β (Query logs) β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Conclusion
Neo4j excels for relationship-heavy workloads at scale. Key takeaways:
- Production deployment: Kubernetes + Causal Cluster (3+ nodes)
- Performance: Size page cache to dataset, monitor hit rate
- Security: TLS + parameterized queries + RBAC
- Backups: Daily dumps, test recovery quarterly
- Observability: JMX metrics, slow query logs, health probes
For SRE teams: Treat Neo4j like any production databaseβversion pin, backup test, alert on replication lag, and assume failure will happen.