This guide provides comprehensive troubleshooting information for the Neo4j Kubernetes Operator, covering both Neo4jEnterpriseCluster and Neo4jEnterpriseStandalone deployments.
# Check deployment status
kubectl get neo4jenterprisecluster
kubectl get neo4jenterprisestandalone
kubectl get neo4jdatabase
# View detailed information
kubectl describe neo4jenterprisecluster <cluster-name>
kubectl describe neo4jenterprisestandalone <standalone-name>
kubectl describe neo4jdatabase <database-name>
# Check pod status
# Clusters
kubectl get pods -l neo4j.com/cluster=<cluster-name>
kubectl logs -l neo4j.com/cluster=<cluster-name>
# Standalone
kubectl get pods -l app=<standalone-name>
kubectl logs -l app=<standalone-name>
# Check events
kubectl get events --sort-by=.metadata.creationTimestamp
# Check operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager# For clusters
kubectl port-forward svc/<cluster-name>-client 7474:7474 7687:7687
# For standalone deployments
kubectl port-forward svc/<standalone-name>-service 7474:7474 7687:7687This is most common with TLS-enabled clusters where nodes fail to join during initial formation.
Quick Check:
# Check each node's view of the cluster
for i in 0 1 2; do
kubectl exec <cluster>-server-$i -- cypher-shell -u neo4j -p <password> "SHOW SERVERS" | wc -l
doneSolution: See the comprehensive Split-Brain Recovery Guide or use the Quick Reference.
Quick Fix:
# Restart minority cluster nodes (orphaned pods)
kubectl delete pod <cluster>-server-1 <cluster>-server-2Test namespaces get stuck in "Terminating" state due to resources with finalizers.
Solution: Ensure proper cleanup in test code:
// Always remove finalizers before deletion
if len(resource.GetFinalizers()) > 0 {
resource.SetFinalizers([]string{})
_ = k8sClient.Update(ctx, resource)
}
_ = k8sClient.Delete(ctx, resource)Test waits for wrong readiness field on standalone deployments.
Solution: Check the correct status field:
// For standalone deployments
return standalone.Status.Ready // NOT Status.Conditions
// Correct pod label selector
client.MatchingLabels{"app": standalone.Name}Integration tests fail because operator is not running.
Solution: Deploy operator before running tests:
kubectl config use-context kind-neo4j-operator-test
make operator-setup # Deploy operator to cluster
make test-integrationGitHub Actions CI often fails with "Unschedulable - 0/1 nodes are available: 1 Insufficient memory" when running integration tests.
Root Cause: CI environments have limited memory (~7GB total), but tests request 1Gi+ per Neo4j pod.
Solution - Use CI Workflow Emulation:
# Reproduce CI environment locally with debug logging
make test-ci-localWhat CI Emulation Provides:
- Identical Environment: Sets
CI=true GITHUB_ACTIONS=truevariables - Memory Constraints: Uses 512Mi memory limits (same as CI)
- Debug Logging: Comprehensive logs saved to
logs/ci-local-*.log - Complete Workflow: Unit tests → Integration tests → Cleanup
- Troubleshooting: Auto-provided diagnostic commands on failure
Generated Debug Files:
logs/ci-local-unit.log- Unit test output with environment infologs/ci-local-integration.log- Integration test output with cluster setuplogs/ci-local-cleanup.log- Environment cleanup output
Manual Resource Debugging:
# Check memory allocation in CI logs
cat logs/ci-local-integration.log | grep -E "(memory|Memory|512Mi)"
# Check pod resource requests
kubectl describe pod <pod-name> | grep -A10 "Requests"
# Monitor real-time memory usage
kubectl top pod <pod-name> --containers
# Check for OOMKilled pods
kubectl get events | grep OOMKilledKey Resource Requirements:
- CI Environment: 512Mi memory limits per pod
- Local Development: 1.5Gi memory limits per pod (Neo4j Enterprise minimum)
- Automatic Detection: Tests use
getCIAppropriateResourceRequirements()function
Prevention:
# Always test with CI constraints before pushing
make test-ci-local
# If CI emulation passes, CI should pass too
echo "✅ Ready for CI deployment"Error: Neo4jEnterpriseCluster requires minimum 2 servers for clustering. For single-node deployments, use Neo4jEnterpriseStandalone instead
Solution: Use the correct CRD for your deployment type:
For development/testing (single-node):
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseStandalone
metadata:
name: dev-neo4j
spec:
image:
repo: neo4j
tag: "5.26-enterprise"
storage:
className: standard
size: "10Gi"For production (minimum cluster):
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseCluster
metadata:
name: prod-cluster
spec:
topology:
servers: 2 # Minimum required for clustering
image:
repo: neo4j
tag: "5.26-enterprise"
storage:
className: standard
size: "10Gi"Error: Neo4j version 5.25.0 is not supported. Minimum required version is 5.26.0
Solution: Update to a supported version:
spec:
image:
tag: "5.26-enterprise" # or laterSupported versions:
- Semver: 5.26.0, 5.26.1 (5.26.x is the last semver LTS — no 5.27+ exists)
- Calver: 2025.01.0, 2025.06.1, 2026.01.0+
# Check pod events
kubectl describe pod <pod-name>
# Common causes:
# - Insufficient resources
# - Storage issues
# - Image pull issuesSolutions:
-
Check Resource Availability:
kubectl describe nodes kubectl get pv
-
Verify Storage Class:
kubectl get storageclass kubectl describe storageclass <storage-class-name>
-
Check Image Pull:
kubectl describe pod <pod-name> | grep -A 5 "Events:"
# Check pod logs
kubectl logs <pod-name> --previousCommon causes and solutions:
-
Memory Issues:
spec: resources: requests: memory: "2Gi" limits: memory: "4Gi"
-
Configuration Issues:
# Check ConfigMap kubectl get configmap <cluster-name>-config -o yaml
-
License Issues:
# Check license secret kubectl get secret <license-secret> -o yaml
# Test connectivity
kubectl port-forward svc/<service-name> 7474:7474 7687:7687
curl http://localhost:7474
# Check service
kubectl get svc -l app.kubernetes.io/name=neo4j
kubectl describe svc <service-name>Solutions:
-
Check Service Configuration:
# For clusters service: <cluster-name>-client # For standalone service: <standalone-name>-service
-
Verify Network Policies:
kubectl get networkpolicies kubectl describe networkpolicy <policy-name>
-
Check TLS Configuration:
# For TLS-enabled deployments kubectl get certificates kubectl describe certificate <cert-name>
# Check cluster status
kubectl get neo4jenterprisecluster <cluster-name> -o yaml
# Check individual pod logs
kubectl logs <cluster-name>-server-0
kubectl logs <cluster-name>-server-1Solutions:
-
🔧 Verify LIST Discovery Configuration
The operator uses LIST discovery with static pod FQDNs (port 6000). Check the startup script in the cluster ConfigMap:
kubectl get configmap <cluster-name>-config -o yaml | grep -A 3 "resolver_type" # Neo4j 5.26.x should show: # dbms.cluster.discovery.resolver_type=LIST # dbms.cluster.discovery.version=V2_ONLY # dbms.cluster.discovery.v2.endpoints=<cluster>-server-0.<cluster>-headless.<ns>.svc.cluster.local:6000,... # Neo4j 2025.x+ should show: # dbms.cluster.discovery.resolver_type=LIST # dbms.cluster.endpoints=<cluster>-server-0.<cluster>-headless.<ns>.svc.cluster.local:6000,...
If K8S or wrong ports appear: upgrade to the latest operator version — this was fixed in favour of LIST discovery.
-
Verify Cluster Topology:
# Ensure minimum topology requirements kubectl get neo4jenterprisecluster <cluster-name> -o jsonpath='{.spec.topology}'
-
Check Inter-Pod Communication:
# Test DNS resolution to headless service kubectl exec -it <pod-name> -- nslookup <cluster-name>-headless # Test cluster port connectivity (5000 = discovery, 6000 = V2 tcp-tx) kubectl exec -it <pod-name> -- nc -zv localhost 5000 kubectl exec -it <pod-name> -- nc -zv localhost 6000
-
Verify Discovery Labels:
# Check that only the discovery service has clustering label
kubectl get svc -l neo4j.com/cluster= -o yaml | grep -A 3 -B 3 "neo4j.com/clustering"
#### Problem: Scaling Issues
```bash
# Check scaling validation
kubectl get events | grep -i scale
Solutions:
-
Verify Minimum Topology:
# Scaling cannot violate minimum requirements spec: topology: primaries: 1 secondaries: 1 # Cannot scale below this
-
Check Resource Limits:
spec: resources: requests: cpu: "500m" memory: "2Gi"
# Check standalone status
kubectl get neo4jenterprisestandalone <standalone-name> -o yaml
# Check pod events
kubectl describe pod <standalone-name>-0Solutions:
-
Check Standalone Configuration:
# Uses unified clustering infrastructure (Neo4j 5.26+) # No manual configuration needed for single-node operation
-
Verify Storage Configuration:
spec: storage: className: standard size: "10Gi"
# Create backup first
kubectl apply -f backup.yaml
# Deploy standalone
kubectl apply -f standalone.yaml
# Restore data
kubectl apply -f restore.yaml# Check resource usage
kubectl top pods
kubectl top nodes
# Check Neo4j metrics
kubectl port-forward svc/<service-name> 7474:7474
# Access http://localhost:7474/metricsSolutions:
-
Adjust Memory Settings:
spec: config: server.memory.heap.initial_size: "2G" server.memory.heap.max_size: "4G" server.memory.pagecache.size: "2G"
-
Enable Query Logging:
spec: config: dbms.logs.query.enabled: "true" dbms.logs.query.threshold: "1s"
-
Check Storage Performance:
# Test storage I/O kubectl exec -it <pod-name> -- dd if=/dev/zero of=/data/test bs=1M count=1000
# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>
# Check storage class
kubectl get storageclassSolutions:
-
Verify Storage Class:
spec: storage: className: fast-ssd # Ensure this exists size: "50Gi"
-
Check Node Storage:
kubectl describe nodes df -h # On nodes
# Check Neo4j consistency
kubectl exec -it <pod-name> -- neo4j-admin check-consistencySolutions:
-
Run Consistency Check:
kubectl exec -it <pod-name> -- neo4j-admin check-consistency --database=neo4j
-
Restore from Backup:
kubectl apply -f restore-from-backup.yaml
Backup jobs fail with "permission denied" or "cannot exec into pod" errors.
Solution: The operator now automatically creates RBAC resources. If you're upgrading:
# Ensure operator has latest permissions
make install # After cloning the repository
# Check operator has pods/exec and pods/log permissions
kubectl describe clusterrole neo4j-operator-manager-role | grep -E "pods/exec|pods/log"Note: Starting with the latest version, the operator automatically creates:
- Service accounts for backup jobs
- Roles with
pods/execandpods/logpermissions - Role bindings for secure backup execution
Neo4j 5.26+ requires backup destination path to exist.
Solution: The operator's backup pod (clusters) or backup sidecar (standalone) automatically creates paths. Check the backup container is running:
# Cluster backup pod
kubectl get pod <cluster>-backup-0 -o yaml | grep backup
kubectl logs <cluster>-backup-0 -c backup
# Standalone backup sidecar
kubectl logs <neo4j-pod> -c backup-sidecar# Check auth secret
kubectl get secret <auth-secret> -o yaml
# Check Neo4j auth logs
kubectl logs <pod-name> | grep -i authSolutions:
-
Verify Admin Secret:
apiVersion: v1 kind: Secret metadata: name: neo4j-admin-secret data: username: bmVvNGo= # base64 encoded password: cGFzc3dvcmQ= # base64 encoded
-
Check Password Policy:
spec: auth: passwordPolicy: minLength: 8 requireUppercase: true
# Check certificate status
kubectl get certificates
kubectl describe certificate <cert-name>
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-managerSolutions:
-
Verify Issuer:
spec: tls: mode: cert-manager issuerRef: name: ca-cluster-issuer kind: ClusterIssuer
-
Check Certificate Details:
kubectl get secret <tls-secret> -o yaml
-
TLS Cluster Formation Issues:
TLS-enabled clusters are prone to split-brain during initial formation. If you see partial cluster formation:
# Check for split clusters kubectl exec <cluster>-server-0 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS" kubectl exec <cluster>-server-1 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS"
Prevention:
spec: config: # Increase discovery timeouts for TLS clusters dbms.cluster.discovery.v2.initial_timeout: "10s" dbms.cluster.discovery.v2.retry_timeout: "20s" # Note: Do NOT override dbms.cluster.raft.membership.join_timeout # The operator sets it to 10m which is optimal
See Split-Brain Recovery Guide for detailed recovery procedures.
# Check database status
kubectl get neo4jdatabase <database-name> -o yaml
kubectl describe neo4jdatabase <database-name>
# Check events specific to the database
kubectl get events --field-selector involvedObject.name=<database-name>Common causes and solutions:
-
Cluster Not Ready:
# Error: Referenced cluster my-cluster not found # Solution: Ensure cluster exists and is ready spec: clusterRef: existing-cluster-name # Must match actual cluster
-
Topology Exceeds Cluster Capacity:
# Error: database topology requires 5 servers but cluster only has 3 servers available # Solution: Adjust topology to fit cluster capacity spec: topology: primaries: 2 # Reduce from 3 secondaries: 1 # Reduce from 2
-
Invalid Configuration Conflicts:
# Error: seedURI and initialData cannot be specified together # Solution: Choose one data source method spec: seedURI: "s3://my-backups/db.backup" # initialData: null # Remove this section
# Check validation errors
kubectl describe neo4jdatabase <database-name>
# Check operator logs for seed URI specific errors
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i seedCommon seed URI issues:
-
Authentication Failures:
# Check credentials secret exists kubectl get secret <credentials-secret> -o yaml # Verify required keys for your URI scheme # S3: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY # GCS: GOOGLE_APPLICATION_CREDENTIALS # Azure: AZURE_STORAGE_ACCOUNT + (AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN)
-
URI Access Issues:
# Test URI access from a pod kubectl run test-pod --rm -it --image=amazon/aws-cli \ -- aws s3 ls s3://my-bucket/backup.backup # For GCS kubectl run test-pod --rm -it --image=google/cloud-sdk:slim \ -- gsutil ls gs://my-bucket/backup.backup
-
Invalid URI Format:
# Error: URI must specify a host # Bad: s3:///path/backup.backup # Good: s3://bucket-name/path/backup.backup # Error: URI must specify a path to the backup file # Bad: s3://bucket-name/ # Good: s3://bucket-name/backup.backup
-
Point-in-Time Recovery Issues:
# Warning: Point-in-time recovery (restoreUntil) is only available with Neo4j 2025.x # Solution: Only use restoreUntil with Neo4j 2025.x clusters seedConfig: restoreUntil: "2025-01-15T10:30:00Z" # Neo4j 2025.x only
-
Performance Issues with Seed URI:
# Warning: Using dump file format. For better performance with large databases, consider using Neo4j backup format (.backup) instead. # Solution: Use .backup format for large datasets seedURI: "s3://my-backups/database.backup" # Instead of .dump # Optimize seed configuration for better performance seedConfig: config: compression: "lz4" # Faster than gzip bufferSize: "256MB" # Larger buffer for big files validation: "lenient" # Skip intensive validation
# Check database status conditions
kubectl get neo4jdatabase <database-name> -o jsonpath='{.status.conditions[*].message}'
# Monitor database creation progress
kubectl get events -w --field-selector involvedObject.name=<database-name>Solutions:
-
Check Cluster Connectivity:
# Ensure operator can connect to Neo4j cluster kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i "connection failed"
-
Large Backup Restoration:
# Monitor restoration progress (seed URI databases) kubectl logs <cluster-pod> | grep -i "restore\|seed" # For large backups, restoration may take significant time # Ensure adequate pod resources
-
Network Connectivity Issues:
# For seed URI, test network access from Neo4j pods kubectl exec -it <cluster-pod> -- curl -I <your-backup-url>
# Connect to database and check
kubectl exec -it <cluster-pod> -- cypher-shell -u neo4j -p <password> -d <database-name> "MATCH (n) RETURN count(n)"Solutions:
-
Initial Data Not Applied:
# Check if initial data import completed kubectl get neo4jdatabase <database-name> -o jsonpath='{.status.dataImported}' # Check for import errors in operator logs kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i "initial data\|import"
-
Seed URI Data Not Restored:
# Check if seed restoration completed kubectl get events --field-selector involvedObject.name=<database-name> | grep -i "DataSeeded" # Verify seed URI is accessible and contains data
Enable debug logging in the operator:
kubectl patch deployment neo4j-operator-controller-manager \
-n neo4j-operator \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'Monitor resource usage:
# Watch resource usage
watch kubectl top pods
watch kubectl top nodes
# Check resource limits
kubectl describe limitrange
kubectl describe resourcequotaTest network connectivity:
# DNS resolution
kubectl exec -it <pod-name> -- nslookup <service-name>
# Port connectivity
kubectl exec -it <pod-name> -- telnet <service-name> 7687
# Network policies
kubectl get networkpolicies --all-namespacesUse this script to collect comprehensive diagnostic information:
#!/bin/bash
# neo4j-debug.sh - Collect diagnostic information
echo "=== Neo4j Kubernetes Operator Diagnostic Report ==="
echo "Generated: $(date)"
echo
echo "=== Cluster Resources ==="
kubectl get neo4jenterprisecluster
echo
echo "=== Standalone Resources ==="
kubectl get neo4jenterprisestandalone
echo
echo "=== Cluster Pods ==="
kubectl get pods -l neo4j.com/cluster=<cluster-name>
echo
echo "=== Standalone Pods ==="
kubectl get pods -l app=<standalone-name>
echo
echo "=== Services ==="
kubectl get svc -l app.kubernetes.io/name=neo4j
echo
echo "=== PVCs ==="
kubectl get pvc
echo
echo "=== ConfigMaps ==="
kubectl get configmap
echo
echo "=== Secrets ==="
kubectl get secret
echo
echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20
echo
echo "=== Operator Logs (last 100 lines) ==="
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager --tail=100
echo
echo "=== Storage Classes ==="
kubectl get storageclass
echo
echo "=== Node Resources ==="
kubectl describe nodes | grep -A 5 "Allocated resources:"- Documentation: User Guide
- API Reference: Neo4jEnterpriseCluster, Neo4jEnterpriseStandalone
- Migration Guide: Migration Guide
- Community: Neo4j Community Forum
- Issues: GitHub Issues
Contact support when:
- Data corruption is suspected
- Cluster formation consistently fails
- Performance is significantly degraded
- Security incidents occur
- Migration issues cannot be resolved
Always provide the diagnostic report and specific error messages when contacting support.