Troubleshooting Guide

This guide provides comprehensive troubleshooting information for the Neo4j Kubernetes Operator, covering both Neo4jEnterpriseCluster and Neo4jEnterpriseStandalone deployments.

Quick Reference

Diagnostic Commands

# Check deployment status
kubectl get neo4jenterprisecluster
kubectl get neo4jenterprisestandalone
kubectl get neo4jdatabase

# View detailed information
kubectl describe neo4jenterprisecluster <cluster-name>
kubectl describe neo4jenterprisestandalone <standalone-name>
kubectl describe neo4jdatabase <database-name>

# Check pod status
kubectl get pods -l app.kubernetes.io/name=neo4j
kubectl logs -l app.kubernetes.io/name=neo4j

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp

# Check operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager

Common Port Forwarding Commands

# For clusters
kubectl port-forward svc/<cluster-name>-client 7474:7474 7687:7687

# For standalone deployments
kubectl port-forward svc/<standalone-name>-service 7474:7474 7687:7687

Common Issues and Solutions

1. Split-Brain Scenarios

Problem: Cluster nodes form multiple independent clusters

This is most common with TLS-enabled clusters where nodes fail to join during initial formation.

Quick Check:

# Check each node's view of the cluster
for i in 0 1 2; do
  kubectl exec <cluster>-server-$i -- cypher-shell -u neo4j -p <password> "SHOW SERVERS" | wc -l
done

Solution: See the comprehensive Split-Brain Recovery Guide or use the Quick Reference.

Quick Fix:

# Restart minority cluster nodes (orphaned pods)
kubectl delete pod <cluster>-server-1 <cluster>-server-2

2. Test Environment Issues

Problem: Integration tests failing with namespace termination issues

Test namespaces get stuck in "Terminating" state due to resources with finalizers.

Solution: Ensure proper cleanup in test code:

// Always remove finalizers before deletion
if len(resource.GetFinalizers()) > 0 {
    resource.SetFinalizers([]string{})
    _ = k8sClient.Update(ctx, resource)
}
_ = k8sClient.Delete(ctx, resource)

Problem: Backup sidecar test timeout

Test waits for wrong readiness field on standalone deployments.

Solution: Check the correct status field:

// For standalone deployments
return standalone.Status.Ready  // NOT Status.Conditions

// Correct pod label selector
client.MatchingLabels{"app": standalone.Name}

Problem: Operator not deployed in test cluster

Integration tests fail because operator is not running.

Solution: Deploy operator before running tests:

kubectl config use-context kind-neo4j-operator-test
make operator-setup  # Deploy operator to cluster
make test-integration

Problem: CI Failures Due to Resource Constraints (Added 2025-08-22)

GitHub Actions CI often fails with "Unschedulable - 0/1 nodes are available: 1 Insufficient memory" when running integration tests.

Root Cause: CI environments have limited memory (~7GB total), but tests request 1Gi+ per Neo4j pod.

Solution - Use CI Workflow Emulation:

# Reproduce CI environment locally with debug logging
make test-ci-local

What CI Emulation Provides:

Identical Environment: Sets CI=true GITHUB_ACTIONS=true variables
Memory Constraints: Uses 512Mi memory limits (same as CI)
Debug Logging: Comprehensive logs saved to logs/ci-local-*.log
Complete Workflow: Unit tests → Integration tests → Cleanup
Troubleshooting: Auto-provided diagnostic commands on failure

Generated Debug Files:

logs/ci-local-unit.log - Unit test output with environment info
logs/ci-local-integration.log - Integration test output with cluster setup
logs/ci-local-cleanup.log - Environment cleanup output

Manual Resource Debugging:

# Check memory allocation in CI logs
cat logs/ci-local-integration.log | grep -E "(memory|Memory|512Mi)"

# Check pod resource requests
kubectl describe pod <pod-name> | grep -A10 "Requests"

# Monitor real-time memory usage
kubectl top pod <pod-name> --containers

# Check for OOMKilled pods
kubectl get events | grep OOMKilled

Key Resource Requirements:

CI Environment: 512Mi memory limits per pod
Local Development: 1.5Gi memory limits per pod (Neo4j Enterprise minimum)
Automatic Detection: Tests use getCIAppropriateResourceRequirements() function

Prevention:

# Always test with CI constraints before pushing
make test-ci-local

# If CI emulation passes, CI should pass too
echo "✅ Ready for CI deployment"

3. Deployment Validation Errors

Problem: Single-Node Cluster Not Allowed

Error: Neo4jEnterpriseCluster requires minimum 2 servers for clustering. For single-node deployments, use Neo4jEnterpriseStandalone instead

Solution: Use the correct CRD for your deployment type:

For development/testing (single-node):

apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseStandalone
metadata:
  name: dev-neo4j
spec:
  image:
    repo: neo4j
    tag: "5.26-enterprise"
  storage:
    className: standard
    size: "10Gi"

For production (minimum cluster):

apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseCluster
metadata:
  name: prod-cluster
spec:
  topology:
    servers: 2  # Minimum required for clustering
  image:
    repo: neo4j
    tag: "5.26-enterprise"
  storage:
    className: standard
    size: "10Gi"

Problem: Invalid Neo4j Version

Error: Neo4j version 5.25.0 is not supported. Minimum required version is 5.26.0

Solution: Update to a supported version:

spec:
  image:
    tag: "5.26-enterprise"  # or later

Supported versions:

Semver: 5.26.0, 5.26.1, 5.27.0, 6.0.0+
Calver: 2025.01.0, 2025.06.1, 2026.01.0+

2. Pod Startup Issues

Problem: Pods Stuck in Pending State

# Check pod events
kubectl describe pod <pod-name>

# Common causes:
# - Insufficient resources
# - Storage issues
# - Image pull issues

Solutions:

Check Resource Availability:
```
kubectl describe nodes
kubectl get pv
```

Verify Storage Class:

kubectl get storageclass
kubectl describe storageclass <storage-class-name>

Check Image Pull:

kubectl describe pod <pod-name> | grep -A 5 "Events:"

Problem: Pods Crashing (CrashLoopBackOff)

# Check pod logs
kubectl logs <pod-name> --previous

Common causes and solutions:

Memory Issues:

spec:
  resources:
    requests:
      memory: "2Gi"
    limits:
      memory: "4Gi"

Configuration Issues:

# Check ConfigMap
kubectl get configmap <cluster-name>-config -o yaml

License Issues:

# Check license secret
kubectl get secret <license-secret> -o yaml

3. Connectivity Issues

Problem: Cannot Connect to Neo4j

# Test connectivity
kubectl port-forward svc/<service-name> 7474:7474 7687:7687
curl http://localhost:7474

# Check service
kubectl get svc -l app.kubernetes.io/name=neo4j
kubectl describe svc <service-name>

Solutions:

Check Service Configuration:

# For clusters
service: <cluster-name>-client

# For standalone
service: <standalone-name>-service

Verify Network Policies:

kubectl get networkpolicies
kubectl describe networkpolicy <policy-name>

Check TLS Configuration:

# For TLS-enabled deployments
kubectl get certificates
kubectl describe certificate <cert-name>

4. Cluster-Specific Issues

Problem: Cluster Formation Fails

# Check cluster status
kubectl get neo4jenterprisecluster <cluster-name> -o yaml

# Check individual pod logs
kubectl logs <cluster-name>-0
kubectl logs <cluster-name>-1

Solutions:

🔧 CRITICAL FIX: V2_ONLY Discovery Configuration

Issue: Neo4j 5.26+ and 2025.x use V2_ONLY discovery mode which disables the discovery port (6000) and only uses the cluster port (5000).

Verification: Check that the operator is using the correct configuration:

# Check ConfigMap for correct discovery configuration
kubectl get configmap <cluster-name>-config -o yaml | grep -A 5 -B 5 "tcp-discovery"

# Should show (Neo4j 5.26+):
# dbms.kubernetes.discovery.v2.service_port_name=tcp-discovery
# dbms.cluster.discovery.version=V2_ONLY

# Should show (Neo4j 2025.x):
# dbms.kubernetes.discovery.service_port_name=tcp-discovery
# (V2_ONLY is default, not explicitly set)

Fix: Ensure operator version includes the V2_ONLY discovery fix. If using older version, upgrade to latest.

Verify Cluster Topology:

# Ensure minimum topology requirements
kubectl get neo4jenterprisecluster <cluster-name> -o jsonpath='{.spec.topology}'

Check Inter-Pod Communication:

# Test DNS resolution to headless service
kubectl exec -it <pod-name> -- nslookup <cluster-name>-headless

# Test cluster port connectivity (5000)
kubectl exec -it <pod-name> -- timeout 2 bash -c "</dev/tcp/localhost/5000"

Verify Discovery Labels:

# Check that only headless service has clustering label
kubectl get svc -l neo4j.com/cluster=<cluster-name> -o yaml | grep -A 3 -B 3 "neo4j.com/clustering"

Problem: Scaling Issues

# Check scaling validation
kubectl get events | grep -i scale

Solutions:

Verify Minimum Topology:

# Scaling cannot violate minimum requirements
spec:
  topology:
    primaries: 1
    secondaries: 1  # Cannot scale below this

Check Resource Limits:

spec:
  resources:
    requests:
      cpu: "500m"
      memory: "2Gi"

5. Standalone-Specific Issues

Problem: Standalone Pod Won't Start

# Check standalone status
kubectl get neo4jenterprisestandalone <standalone-name> -o yaml

# Check pod events
kubectl describe pod <standalone-name>-0

Solutions:

Check Standalone Configuration:

# Uses unified clustering infrastructure (Neo4j 5.26+)
# No manual configuration needed for single-node operation

Verify Storage Configuration:

spec:
  storage:
    className: standard
    size: "10Gi"

Problem: Migration from Cluster to Standalone

# Create backup first
kubectl apply -f backup.yaml

# Deploy standalone
kubectl apply -f standalone.yaml

# Restore data
kubectl apply -f restore.yaml

6. Performance Issues

Problem: Slow Query Performance

# Check resource usage
kubectl top pods
kubectl top nodes

# Check Neo4j metrics
kubectl port-forward svc/<service-name> 7474:7474
# Access http://localhost:7474/metrics

Solutions:

Adjust Memory Settings:

spec:
  config:
    server.memory.heap.initial_size: "2G"
    server.memory.heap.max_size: "4G"
    server.memory.pagecache.size: "2G"

Enable Query Logging:

spec:
  config:
    dbms.logs.query.enabled: "true"
    dbms.logs.query.threshold: "1s"

Check Storage Performance:

# Test storage I/O
kubectl exec -it <pod-name> -- dd if=/dev/zero of=/data/test bs=1M count=1000

7. Storage Issues

Problem: PVC Issues

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check storage class
kubectl get storageclass

Solutions:

Verify Storage Class:

spec:
  storage:
    className: fast-ssd  # Ensure this exists
    size: "50Gi"

Check Node Storage:

kubectl describe nodes
df -h  # On nodes

Problem: Data Corruption

# Check Neo4j consistency
kubectl exec -it <pod-name> -- neo4j-admin check-consistency

Solutions:

Run Consistency Check:

kubectl exec -it <pod-name> -- neo4j-admin check-consistency --database=neo4j

Restore from Backup:

kubectl apply -f restore-from-backup.yaml

8. Backup and Restore Issues

Problem: Backup failing with permission denied

Backup jobs fail with "permission denied" or "cannot exec into pod" errors.

Solution: The operator now automatically creates RBAC resources. If you're upgrading:

# Ensure operator has latest permissions
make install  # After cloning the repository

# Check operator has pods/exec and pods/log permissions
kubectl describe clusterrole neo4j-operator-manager-role | grep -E "pods/exec|pods/log"

Note: Starting with the latest version, the operator automatically creates:

Service accounts for backup jobs
Roles with pods/exec and pods/log permissions
Role bindings for secure backup execution

Problem: Backup path not found

Neo4j 5.26+ requires backup destination path to exist.

Solution: The operator's backup sidecar automatically creates paths. Check sidecar is running:

# Check backup sidecar is present
kubectl get pod <neo4j-pod> -o yaml | grep backup-sidecar

# Check sidecar logs
kubectl logs <neo4j-pod> -c backup-sidecar

9. Security Issues

Problem: Authentication Failures

# Check auth secret
kubectl get secret <auth-secret> -o yaml

# Check Neo4j auth logs
kubectl logs <pod-name> | grep -i auth

Solutions:

Verify Admin Secret:

apiVersion: v1
kind: Secret
metadata:
  name: neo4j-admin-secret
data:
  username: bmVvNGo=  # base64 encoded
  password: cGFzc3dvcmQ=  # base64 encoded

Check Password Policy:

spec:
  auth:
    passwordPolicy:
      minLength: 8
      requireUppercase: true

Problem: TLS Certificate Issues

# Check certificate status
kubectl get certificates
kubectl describe certificate <cert-name>

# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager

Solutions:

Verify Issuer:

spec:
  tls:
    mode: cert-manager
    issuerRef:
      name: ca-cluster-issuer
      kind: ClusterIssuer

Check Certificate Details:
```
kubectl get secret <tls-secret> -o yaml
```

TLS Cluster Formation Issues:

TLS-enabled clusters are prone to split-brain during initial formation. If you see partial cluster formation:

# Check for split clusters
kubectl exec <cluster>-server-0 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS"
kubectl exec <cluster>-server-1 -- cypher-shell -u neo4j -p <password> "SHOW SERVERS"

Prevention:

spec:
  config:
    # Increase discovery timeouts for TLS clusters
    dbms.cluster.discovery.v2.initial_timeout: "10s"
    dbms.cluster.discovery.v2.retry_timeout: "20s"
    # Note: Do NOT override dbms.cluster.raft.membership.join_timeout
    # The operator sets it to 10m which is optimal

See Split-Brain Recovery Guide for detailed recovery procedures.

10. Database Creation Issues

Problem: Neo4jDatabase Creation Fails

# Check database status
kubectl get neo4jdatabase <database-name> -o yaml
kubectl describe neo4jdatabase <database-name>

# Check events specific to the database
kubectl get events --field-selector involvedObject.name=<database-name>

Common causes and solutions:

Cluster Not Ready:

# Error: Referenced cluster my-cluster not found
# Solution: Ensure cluster exists and is ready
spec:
  clusterRef: existing-cluster-name  # Must match actual cluster

Topology Exceeds Cluster Capacity:

# Error: database topology requires 5 servers but cluster only has 3 servers available
# Solution: Adjust topology to fit cluster capacity
spec:
  topology:
    primaries: 2     # Reduce from 3
    secondaries: 1   # Reduce from 2

Invalid Configuration Conflicts:

# Error: seedURI and initialData cannot be specified together
# Solution: Choose one data source method
spec:
  seedURI: "s3://my-backups/db.backup"
  # initialData: null  # Remove this section

Problem: Seed URI Database Creation Fails

# Check validation errors
kubectl describe neo4jdatabase <database-name>

# Check operator logs for seed URI specific errors
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i seed

Common seed URI issues:

Authentication Failures:

# Check credentials secret exists
kubectl get secret <credentials-secret> -o yaml

# Verify required keys for your URI scheme
# S3: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# GCS: GOOGLE_APPLICATION_CREDENTIALS
# Azure: AZURE_STORAGE_ACCOUNT + (AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN)

URI Access Issues:

# Test URI access from a pod
kubectl run test-pod --rm -it --image=amazon/aws-cli \
  -- aws s3 ls s3://my-bucket/backup.backup

# For GCS
kubectl run test-pod --rm -it --image=google/cloud-sdk:slim \
  -- gsutil ls gs://my-bucket/backup.backup

Invalid URI Format:

# Error: URI must specify a host
# Bad: s3:///path/backup.backup
# Good: s3://bucket-name/path/backup.backup

# Error: URI must specify a path to the backup file
# Bad: s3://bucket-name/
# Good: s3://bucket-name/backup.backup

Point-in-Time Recovery Issues:

# Warning: Point-in-time recovery (restoreUntil) is only available with Neo4j 2025.x
# Solution: Only use restoreUntil with Neo4j 2025.x clusters
seedConfig:
  restoreUntil: "2025-01-15T10:30:00Z"  # Neo4j 2025.x only

Performance Issues with Seed URI:

# Warning: Using dump file format. For better performance with large databases, consider using Neo4j backup format (.backup) instead.
# Solution: Use .backup format for large datasets
seedURI: "s3://my-backups/database.backup"  # Instead of .dump

# Optimize seed configuration for better performance
seedConfig:
  config:
    compression: "lz4"      # Faster than gzip
    bufferSize: "256MB"     # Larger buffer for big files
    validation: "lenient"   # Skip intensive validation

Problem: Database Stuck in Creating State

# Check database status conditions
kubectl get neo4jdatabase <database-name> -o jsonpath='{.status.conditions[*].message}'

# Monitor database creation progress
kubectl get events -w --field-selector involvedObject.name=<database-name>

Solutions:

Check Cluster Connectivity:

# Ensure operator can connect to Neo4j cluster
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i "connection failed"

Large Backup Restoration:

# Monitor restoration progress (seed URI databases)
kubectl logs <cluster-pod> | grep -i "restore\|seed"

# For large backups, restoration may take significant time
# Ensure adequate pod resources

Network Connectivity Issues:

# For seed URI, test network access from Neo4j pods
kubectl exec -it <cluster-pod> -- curl -I <your-backup-url>

Problem: Database Ready But No Data

# Connect to database and check
kubectl exec -it <cluster-pod> -- cypher-shell -u neo4j -p <password> -d <database-name> "MATCH (n) RETURN count(n)"

Solutions:

Initial Data Not Applied:

# Check if initial data import completed
kubectl get neo4jdatabase <database-name> -o jsonpath='{.status.dataImported}'

# Check for import errors in operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager | grep -i "initial data\|import"

Seed URI Data Not Restored:

# Check if seed restoration completed
kubectl get events --field-selector involvedObject.name=<database-name> | grep -i "DataSeeded"

# Verify seed URI is accessible and contains data

Advanced Troubleshooting

Debug Mode

Enable debug logging in the operator:

kubectl patch deployment neo4j-operator-controller-manager \
  -n neo4j-operator \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'

Resource Monitoring

Monitor resource usage:

# Watch resource usage
watch kubectl top pods
watch kubectl top nodes

# Check resource limits
kubectl describe limitrange
kubectl describe resourcequota

Network Debugging

Test network connectivity:

# DNS resolution
kubectl exec -it <pod-name> -- nslookup <service-name>

# Port connectivity
kubectl exec -it <pod-name> -- telnet <service-name> 7687

# Network policies
kubectl get networkpolicies --all-namespaces

Collecting Diagnostic Information

Use this script to collect comprehensive diagnostic information:

#!/bin/bash
# neo4j-debug.sh - Collect diagnostic information

echo "=== Neo4j Kubernetes Operator Diagnostic Report ==="
echo "Generated: $(date)"
echo

echo "=== Cluster Resources ==="
kubectl get neo4jenterprisecluster
echo

echo "=== Standalone Resources ==="
kubectl get neo4jenterprisestandalone
echo

echo "=== Pods ==="
kubectl get pods -l app.kubernetes.io/name=neo4j
echo

echo "=== Services ==="
kubectl get svc -l app.kubernetes.io/name=neo4j
echo

echo "=== PVCs ==="
kubectl get pvc -l app.kubernetes.io/name=neo4j
echo

echo "=== ConfigMaps ==="
kubectl get configmap -l app.kubernetes.io/name=neo4j
echo

echo "=== Secrets ==="
kubectl get secret -l app.kubernetes.io/name=neo4j
echo

echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20
echo

echo "=== Operator Logs (last 100 lines) ==="
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager --tail=100
echo

echo "=== Storage Classes ==="
kubectl get storageclass
echo

echo "=== Node Resources ==="
kubectl describe nodes | grep -A 5 "Allocated resources:"

Getting Help

Support Resources

Documentation: User Guide
API Reference: Neo4jEnterpriseCluster, Neo4jEnterpriseStandalone
Migration Guide: Migration Guide
Community: Neo4j Community Forum
Issues: GitHub Issues

When to Contact Support

Contact support when:

Data corruption is suspected
Cluster formation consistently fails
Performance is significantly degraded
Security incidents occur
Migration issues cannot be resolved

Always provide the diagnostic report and specific error messages when contacting support.

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Quick Reference

Diagnostic Commands

Common Port Forwarding Commands

Common Issues and Solutions

1. Split-Brain Scenarios

Problem: Cluster nodes form multiple independent clusters

2. Test Environment Issues

Problem: Integration tests failing with namespace termination issues

Problem: Backup sidecar test timeout

Problem: Operator not deployed in test cluster

Problem: CI Failures Due to Resource Constraints (Added 2025-08-22)

3. Deployment Validation Errors

Problem: Single-Node Cluster Not Allowed

Problem: Invalid Neo4j Version

2. Pod Startup Issues

Problem: Pods Stuck in Pending State

Problem: Pods Crashing (CrashLoopBackOff)

3. Connectivity Issues

Problem: Cannot Connect to Neo4j

4. Cluster-Specific Issues

Problem: Cluster Formation Fails

Problem: Scaling Issues

5. Standalone-Specific Issues

Problem: Standalone Pod Won't Start

Problem: Migration from Cluster to Standalone

6. Performance Issues

Problem: Slow Query Performance

7. Storage Issues

Problem: PVC Issues

Problem: Data Corruption

8. Backup and Restore Issues

Problem: Backup failing with permission denied

Problem: Backup path not found

9. Security Issues

Problem: Authentication Failures

Problem: TLS Certificate Issues

10. Database Creation Issues

Problem: Neo4jDatabase Creation Fails

Problem: Seed URI Database Creation Fails

Problem: Database Stuck in Creating State

Problem: Database Ready But No Data

Advanced Troubleshooting

Debug Mode

Resource Monitoring

Network Debugging

Collecting Diagnostic Information

Getting Help

Support Resources

When to Contact Support