This guide provides comprehensive troubleshooting and recovery procedures for Neo4j cluster split-brain scenarios when using the Neo4j Kubernetes Operator with server-based architecture.
Split-brain occurs when Neo4j cluster servers lose communication and form separate, independent clusters instead of one unified cluster. This can lead to data inconsistencies and cluster instability if not properly detected and resolved.
The Neo4j Kubernetes Operator includes automatic split-brain detection and repair to prevent and resolve these issues proactively.
Split-brain happens when:
- Network partitions separate cluster servers
- Servers cannot communicate with each other
- Multiple independent "clusters" form within the same deployment
- Each partition believes it is the authoritative cluster
- Network partitions between Kubernetes nodes
- Resource constraints causing pod communication failures
- DNS resolution issues preventing server discovery
- Storage problems affecting cluster state persistence
- Configuration errors in discovery or networking
The operator includes comprehensive split-brain detection that runs automatically during cluster health checks.
- Multi-Pod Analysis: Connects to each server pod individually
- Cluster View Comparison: Compares each server's view of cluster membership
- Inconsistency Detection: Identifies servers with conflicting cluster views
- Automatic Repair: Restarts orphaned pods to rejoin the main cluster
Monitor operator logs for split-brain detection:
# Check for split-brain detection logs
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i "split.*brain"
# Expected detection logs:
# Starting split-brain detection for cluster production-cluster, expectedServers: 3
# Split-brain analysis results: isSplitBrain: true, orphanedPods: 1, repairAction: RestartPods
# Split-brain automatically repaired by restarting orphaned pods: [production-cluster-server-2]The operator generates events for split-brain scenarios:
# Check for split-brain events
kubectl get events --field-selector reason=SplitBrainDetected
kubectl get events --field-selector reason=SplitBrainRepaired
# Example events:
# Warning SplitBrainDetected Neo4jEnterpriseCluster/production-cluster Split-brain detected: 1 orphaned servers
# Normal SplitBrainRepaired Neo4jEnterpriseCluster/production-cluster Split-brain repaired: restarted orphaned pods-
Check Server Status:
# Connect to each server and check cluster membership for i in 0 1 2; do echo "=== Server $i ===" kubectl exec production-cluster-server-$i -- cypher-shell -u neo4j -p password \ "SHOW SERVERS YIELD name, state, health ORDER BY name" echo done
-
Compare Cluster Views: Look for inconsistencies in server lists between different pods. In a healthy cluster, all servers should see the same cluster membership.
-
Check Database Allocation:
# Verify database distribution consistency kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \ "SHOW DATABASES YIELD name, currentStatus, role, address"
Indicators of Split-Brain:
- Different server counts reported by different pods
- Inconsistent database allocations across servers
- Some servers showing as "offline" from others' perspectives
- Database creation failures with "insufficient servers" errors
- Application connection failures to some databases
The operator automatically repairs split-brain scenarios by:
- Detection: Identifying orphaned servers with inconsistent cluster views
- Analysis: Determining the main cluster and orphaned servers
- Restart: Gracefully restarting orphaned pods to rejoin the main cluster
- Verification: Confirming successful cluster reformation
No manual intervention required - the operator handles this automatically.
If automatic repair fails or you need to intervene manually:
# Check which partition has the majority of servers
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"
# Count active servers in each partition
kubectl exec production-cluster-server-1 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state ORDER BY name"# Restart the server(s) that show inconsistent cluster views
kubectl delete pod production-cluster-server-2
# Wait for pod to restart and rejoin
kubectl wait --for=condition=Ready pod/production-cluster-server-2 --timeout=300s# Confirm all servers show consistent cluster membership
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name, state, health ORDER BY name"
# Check database status
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
"SHOW DATABASES"If standard restart doesn't work, use cluster-wide restart:
# Delete all server pods simultaneously (data preserved in PVCs)
kubectl delete pods -l app.kubernetes.io/name=neo4j,neo4j.com/cluster=production-cluster
# Monitor cluster reformation
kubectl get pods -l app.kubernetes.io/name=neo4j -w-
Node Affinity Configuration:
spec: topology: servers: 3 placement: antiAffinity: enabled: true type: preferred # Allow scheduling on same node if necessary topologyKey: kubernetes.io/hostname
-
Multi-Zone Deployment:
spec: topology: servers: 3 placement: topologySpread: enabled: true topologyKey: topology.kubernetes.io/zone maxSkew: 1
spec:
resources:
requests:
memory: "4Gi" # Adequate memory to prevent OOM
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"spec:
config:
# Optimize discovery timeouts
dbms.kubernetes.discovery.v2.refresh_rate: "10s"
dbms.cluster.discovery.resolution_timeout: "30s"
# Cluster communication resilience (Neo4j 5.26+)
dbms.cluster.raft.election_timeout: "7s"
dbms.cluster.raft.leader_failure_detection_window: "30s"Monitor these key metrics for early split-brain detection:
# Cluster health metrics
neo4j_cluster_servers_total
neo4j_cluster_servers_online
neo4j_database_allocation_inconsistency
# Alert rules
groups:
- name: neo4j.split-brain
rules:
- alert: Neo4jSplitBrainDetected
expr: neo4j_cluster_servers_online < neo4j_cluster_servers_total
for: 2m
labels:
severity: critical
annotations:
summary: "Neo4j cluster split-brain detected"
description: "Cluster {{ $labels.cluster }} has {{ $value }} online servers out of {{ neo4j_cluster_servers_total }} total servers"Set up log monitoring for split-brain events:
# Alert on split-brain detection logs
kubectl logs -f -n neo4j-operator-system deployment/neo4j-operator-controller-manager | \
grep -E "(split.*brain|Split.*Brain)" --line-buffered | \
while read line; do
echo "ALERT: $line"
# Send to monitoring system
done#!/bin/bash
# Automated cluster health check script
CLUSTER_NAME="production-cluster"
NAMESPACE="default"
check_cluster_health() {
local expected_servers=3
local consistent_views=0
for i in $(seq 0 $((expected_servers-1))); do
local server_count=$(kubectl exec ${CLUSTER_NAME}-server-$i -n $NAMESPACE -- \
cypher-shell -u neo4j -p password \
"SHOW SERVERS YIELD name" 2>/dev/null | wc -l)
if [ "$server_count" -eq "$expected_servers" ]; then
((consistent_views++))
fi
done
if [ "$consistent_views" -eq "$expected_servers" ]; then
echo "✅ Cluster health: OK"
return 0
else
echo "❌ Split-brain detected: $consistent_views/$expected_servers servers have consistent views"
return 1
fi
}
# Run health check
if ! check_cluster_health; then
echo "🔄 Triggering operator reconciliation..."
kubectl annotate neo4jenterprisecluster $CLUSTER_NAME -n $NAMESPACE \
"operator.neo4j.com/force-reconcile=$(date +%s)"
fi-
Check Operator Logs:
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager --tail=100
-
Verify RBAC Permissions:
kubectl auth can-i get pods --as=system:serviceaccount:neo4j-operator-system:neo4j-operator-controller-manager kubectl auth can-i exec pods --as=system:serviceaccount:neo4j-operator-system:neo4j-operator-controller-manager -
Check Neo4j Connectivity:
# Test if operator can connect to Neo4j kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password "RETURN 'test'"
If the operator incorrectly identifies split-brain:
-
Check Resource Constraints:
kubectl describe pods -l app.kubernetes.io/name=neo4j kubectl top pods -l app.kubernetes.io/name=neo4j
-
Verify Network Connectivity:
# Test inter-pod communication kubectl exec production-cluster-server-0 -- nc -zv production-cluster-server-1 5000
-
Review Configuration:
kubectl get neo4jenterprisecluster production-cluster -o yaml | grep -A 20 "spec:"
If automatic recovery fails:
-
Check Pod Status:
kubectl get pods -l app.kubernetes.io/name=neo4j kubectl describe pod production-cluster-server-0
-
Review Events:
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20 -
Inspect Storage:
kubectl get pvc -l app.kubernetes.io/name=neo4j kubectl describe pvc data-production-cluster-server-0
# 1. Scale down the cluster
kubectl patch neo4jenterprisecluster production-cluster --type='json' \
-p='[{"op": "replace", "path": "/spec/topology/servers", "value": 0}]'
# 2. Wait for pods to terminate
kubectl wait --for=delete pod -l app.kubernetes.io/name=neo4j --timeout=300s
# 3. Clean up cluster state (if necessary)
# Note: This may cause data loss - only do if cluster is completely corrupted
# kubectl delete pvc -l app.kubernetes.io/name=neo4j
# 4. Scale back up
kubectl patch neo4jenterprisecluster production-cluster --type='json' \
-p='[{"op": "replace", "path": "/spec/topology/servers", "value": 3}]'
# 5. Monitor recovery
kubectl get pods -l app.kubernetes.io/name=neo4j -wIf split-brain causes data corruption:
# 1. Create restoration cluster
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseCluster
metadata:
name: recovery-cluster
spec:
topology:
servers: 3
# ... same configuration as original cluster
EOF
# 2. Restore from backup
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jRestore
metadata:
name: split-brain-recovery
spec:
clusterRef: recovery-cluster
backupRef: latest-backup
options:
force: true
EOF
# 3. Verify data integrity
kubectl exec recovery-cluster-server-0 -- cypher-shell -u neo4j -p password \
"MATCH (n) RETURN count(n) as node_count"-
Prevention:
- Use adequate resource allocation
- Deploy across multiple zones
- Configure proper network policies
- Monitor cluster health continuously
-
Detection:
- Rely on automatic split-brain detection
- Set up monitoring and alerting
- Regular health checks
-
Recovery:
- Trust automatic repair mechanisms
- Manual intervention only when necessary
- Always verify cluster health after recovery
-
Monitoring:
- Monitor operator logs for split-brain events
- Set up Kubernetes event alerting
- Track cluster consistency metrics
For additional troubleshooting help, see: