Backup and Restore Troubleshooting Guide

This guide provides comprehensive troubleshooting information for Neo4j backup and restore operations using the Kubernetes operator. It covers common issues, diagnostic steps, and solutions for various backup and restore scenarios.

Note: Cluster deployments use a centralized {cluster}-backup pod (container backup). Backup sidecar references apply to standalone deployments.

Prerequisites

Before troubleshooting, ensure you have:

Neo4j Enterprise cluster running version 5.26.0+ (semver) or 2025.01.0+ (calver)
Appropriate RBAC permissions for backup/restore operations
Access to cluster logs and events
Understanding of your storage backend configuration

Quick Diagnostic Commands

General Status Check

# Check backup resource status
kubectl get neo4jbackups
kubectl get neo4jrestores

# View detailed resource information
kubectl describe neo4jbackup <backup-name>
kubectl describe neo4jrestore <restore-name>

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp

# View operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager

Job and Pod Status

# List backup/restore jobs
kubectl get jobs -l app.kubernetes.io/component=backup
kubectl get jobs -l app.kubernetes.io/component=restore

# Check job logs
kubectl logs job/<backup-name>-backup
kubectl logs job/<restore-name>-restore

# Check pod status and logs
kubectl get pods -l app.kubernetes.io/component=backup
kubectl logs <backup-pod-name>

Common Issues and Solutions

1. Version Compatibility Issues

Problem: Neo4j Version Not Supported

Error: Neo4j version 5.25.0 is not supported. Minimum required version is 5.26.0

Diagnosis:

# Check cluster image version
kubectl get neo4jenterprisecluster <cluster-name> -o jsonpath='{.spec.image.tag}'

# Check backup/restore resource events
kubectl describe neo4jbackup <backup-name>

Solutions:

Update Neo4j Version:

spec:
  image:
    tag: "5.26.0-enterprise"  # or later version

Verify Supported Versions:
- Semver: 5.26.0, 5.26.1 (5.26.x is the last semver LTS — no 5.27+ exists)
- Calver: 2025.01.0, 2025.06.1, 2026.01.0+

Problem: Invalid Version Format

Error: invalid Neo4j version format: latest. Expected semver (5.26+) or calver (2025.01+)

Solution: Use specific version tags instead of latest:

spec:
  image:
    tag: "5.26.0-enterprise"

2. Storage Backend Issues

Problem: S3 Access Denied

Error: AccessDenied: Access Denied

Diagnosis:

# Check AWS credentials
kubectl get secret aws-credentials -o yaml

# Verify IAM permissions
aws sts get-caller-identity
aws s3 ls s3://your-backup-bucket/

Solutions:

Verify IAM Permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-backup-bucket",
        "arn:aws:s3:::your-backup-bucket/*"
      ]
    }
  ]
}

Update Service Account Annotations:

metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/neo4j-backup-role

Check Secret Format:

apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
data:
  AWS_ACCESS_KEY_ID: <base64-key>
  AWS_SECRET_ACCESS_KEY: <base64-secret>

Problem: GCS Permission Denied

Error: 403 Forbidden: Permission denied

Solutions:

Verify Service Account Key:

# Check service account secret
kubectl get secret gcs-credentials -o yaml

# Test GCS access
gsutil ls gs://your-backup-bucket/

Required GCS Permissions:
- storage.objects.create
- storage.objects.delete
- storage.objects.get
- storage.objects.list
- storage.buckets.get

Problem: Azure Storage Authentication Failed

Error: AuthenticationFailed: Server failed to authenticate the request

Solutions:

Check Storage Account Key:

apiVersion: v1
kind: Secret
metadata:
  name: azure-credentials
data:
  AZURE_STORAGE_ACCOUNT: <base64-account-name>
  AZURE_STORAGE_KEY: <base64-storage-key>

Verify Container Permissions:

# Test Azure CLI access
az storage blob list --container-name your-container --account-name your-account

Problem: PVC Storage Issues

Error: pod has unbound immediate PersistentVolumeClaims

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check storage class
kubectl get storageclass

Solutions:

Verify Storage Class:

spec:
  storage:
    type: pvc
    pvc:
      name: backup-storage
      size: 100Gi
     storageClassName: fast-ssd  # Ensure this exists

Check Available Storage:

# List nodes and storage
kubectl describe nodes
kubectl get pv

3. Backup Operation Issues

Problem: Backup Job Fails to Start

Status: Failed
Message: Failed to create backup job: pods "backup-job-xyz" is forbidden

Diagnosis:

# Check RBAC permissions for backup job service account
kubectl auth can-i create pods/exec --as=system:serviceaccount:<namespace>:neo4j-backup-sa

# Check service account
kubectl get serviceaccount neo4j-backup-sa -o yaml

Solutions:

Verify RBAC:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: neo4j-backup-role
rules:
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Check Service Account Binding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: neo4j-backup-rolebinding
subjects:
- kind: ServiceAccount
  name: neo4j-backup-sa
  namespace: <namespace>
roleRef:
  kind: Role
  name: neo4j-backup-role
  apiGroup: rbac.authorization.k8s.io

Problem: Backup Path Does Not Exist (Pre-Operator Fix)

Status: Failed
Message: org.neo4j.cli.CommandFailedException: Path '/data/backups/test-backup' does not exist

Note: This issue has been fixed in the latest operator version. The backup sidecar now automatically creates the backup path before executing the backup command.

If you encounter this with an older operator version:

Diagnosis (standalone only):

# Check operator version
kubectl get deployment -n neo4j-operator neo4j-operator-controller-manager -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check backup sidecar logs
kubectl logs <neo4j-pod> -c backup-sidecar

Solutions:

Upgrade to Latest Operator:
- The latest operator version automatically creates backup paths
- Neo4j 5.26+ and 2025.x+ require paths to exist

Temporary Workaround (if upgrade not possible):

# Manually create backup directory in pod
kubectl exec <neo4j-pod> -c backup-sidecar -- mkdir -p /data/backups/<backup-name>

Verify Fix in New Version:

# Check that backup sidecar includes mkdir command
kubectl get pod <neo4j-pod> -o jsonpath='{.spec.containers[?(@.name=="backup-sidecar")].command}' | grep "mkdir -p"

For clusters, check the centralized backup pod instead:

kubectl logs <cluster>-backup-0 -c backup

Problem: Backup Times Out

Status: Failed
Message: Backup job timed out after 2h0m0s

Solutions:

Increase Timeout:

spec:
  timeout: "4h"  # Increase timeout for large databases

Check Resource Limits:

spec:
  options:
    additionalArgs:
      - "--parallel-recovery"
      - "--temp-path=/tmp/backup"

Monitor Disk I/O:

# Check node resources
kubectl top nodes
kubectl top pods

Problem: Backup Verification Fails

Status: Failed
Message: Backup verification failed: inconsistent data detected

Solutions:

Check Database Consistency:

// Connect to Neo4j and run
CALL dbms.checkConsistency()

Disable Verification Temporarily:

spec:
  options:
    verify: false  # Disable for problematic databases

Use Force Flag:
```
spec:
  force: true
```

4. Restore Operation Issues

Problem: Target Cluster Not Ready

Status: Waiting
Message: Target cluster is not ready

Diagnosis:

# Check cluster status
kubectl get neo4jenterprisecluster <cluster-name>
kubectl describe neo4jenterprisecluster <cluster-name>

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=<cluster-name>

Solutions:

Wait for Cluster Readiness:

# Monitor cluster status
kubectl get neo4jenterprisecluster <cluster-name> -w

Check Cluster Configuration:

# Ensure cluster has proper resources
spec:
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"

Problem: Database Already Exists

Status: Failed
Message: database myapp already exists. Use replaceExisting option or force flag

Solutions:

Use Replace Existing:

spec:
  options:
    replaceExisting: true

Use Force Flag:
```
spec:
  force: true
```

Drop Database First:

// Connect to Neo4j and run
DROP DATABASE myapp IF EXISTS

Problem: PITR Transaction Log Issues

Status: Failed
Message: transaction log validation failed: missing log segment

Diagnosis:

# Check transaction log storage
kubectl describe neo4jrestore <restore-name>

# Verify log storage accessibility
aws s3 ls s3://transaction-logs/production/logs/

Solutions:

Check Log Retention:

spec:
  source:
    pitr:
      logRetention: "14d"  # Increase retention period

Disable Log Validation:

spec:
  source:
    pitr:
      validateLogIntegrity: false

Use Different Recovery Point:

spec:
  source:
    pointInTime: "2025-01-04T10:00:00Z"  # Earlier time

5. Networking and Connectivity Issues

Problem: Cannot Connect to Neo4j During Restore

Status: Failed
Message: failed to create Neo4j client: connection refused

Diagnosis:

# Check Neo4j service
kubectl get svc -l app.kubernetes.io/instance=<cluster-name>

# Test connectivity
kubectl port-forward svc/<cluster-name>-client 7687:7687
neo4j-client -u neo4j -p password bolt://localhost:7687

Solutions:

Check Service Configuration:

# Ensure service is properly exposed
spec:
  services:
    neo4j:
      enabled: true
      type: ClusterIP

Verify Network Policies:

kubectl get networkpolicies
kubectl describe networkpolicy <policy-name>

Check Firewall Rules:

# Ensure port 7687 is accessible
telnet <cluster-ip> 7687

6. Resource and Performance Issues

Problem: Out of Memory During Backup

Status: Failed
Message: backup job killed due to memory limit

Solutions:

Increase Job Resources:

# Add to backup job template (requires operator modification)
resources:
  requests:
    memory: "4Gi"
    cpu: "2"
  limits:
    memory: "8Gi"
    cpu: "4"

Use Incremental Backup:

spec:
  options:
    additionalArgs:
      - "--incremental"

Optimize Backup Path:

spec:
  options:
    additionalArgs:
      - "--temp-path=/tmp/backup"
      - "--parallel-recovery"

Problem: Slow Backup Performance

Status: Running (for extended time)

Solutions:

Enable Compression:
```
spec:
  options:
    compress: true
```

Use Parallel Processing:

spec:
  options:
    additionalArgs:
      - "--parallel-recovery"

Check Storage Performance:

# Test storage I/O
kubectl exec -it <backup-pod> -- dd if=/dev/zero of=/backup/test bs=1M count=1000

7. Hook Execution Issues

Problem: Pre-restore Hook Fails

Status: Failed
Message: Pre-restore hooks failed: hook job failed

Diagnosis:

# Check hook job status
kubectl get jobs -l app.kubernetes.io/component=pre-restore

# Check hook job logs
kubectl logs job/<restore-name>-pre-restore-hook

Solutions:

Increase Hook Timeout:

spec:
  options:
    preRestore:
      job:
        timeout: "30m"  # Increase timeout

Fix Hook Script:

spec:
  options:
    preRestore:
      job:
        template:
          container:
            command: ["/bin/sh"]
            args: ["-c", "set -e; /scripts/pre-restore.sh"]  # Add error handling

Problem: Cypher Hook Execution Fails

Status: Failed
Message: failed to execute Cypher statement: syntax error

Solutions:

Validate Cypher Syntax:

spec:
  options:
    postRestore:
      cypherStatements:
        - "CALL db.awaitIndexes(600)"  # Add timeout
        - "MATCH (n:User) WHERE n.created IS NULL SET n.created = datetime()"

Check Database State:

// Verify database is accessible
CALL db.ping()

Advanced Troubleshooting

Debug Mode

Enable debug logging in the operator:

# Restart operator with debug logging
kubectl patch deployment neo4j-operator-controller-manager \
  -n neo4j-operator \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'

Resource Monitoring

Monitor resource usage during operations:

# Watch resource usage
watch kubectl top pods
watch kubectl top nodes

# Monitor storage usage
kubectl exec -it <backup-pod> -- df -h

Network Debugging

Test network connectivity:

# DNS resolution
kubectl exec -it <backup-pod> -- nslookup <cluster-name>-client

# Port connectivity
kubectl exec -it <backup-pod> -- telnet <cluster-name>-client 7687

# Network policies
kubectl get networkpolicies --all-namespaces

Prevention and Best Practices

Monitoring Setup

Set up Alerts:

# Prometheus alert for backup failures
- alert: BackupFailed
  expr: increase(neo4j_backup_failures_total[1h]) > 0
  for: 5m
  annotations:
    summary: "Neo4j backup failed"

Regular Health Checks:

# Weekly backup validation
kubectl get neo4jbackups -o json | jq '.items[] | select(.status.phase != "Completed")'

Capacity Planning

Storage Monitoring:

# Monitor backup storage growth
kubectl get pvc -o jsonpath='{.items[*].status.capacity.storage}'

Performance Baselines:

# Establish backup performance baselines
kubectl get neo4jbackup -o jsonpath='{.items[*].status.stats.duration}'

Regular Testing

Backup Validation:

# Monthly restore tests
kubectl apply -f test-restore.yaml

Disaster Recovery Drills:

# Quarterly DR tests
kubectl apply -f disaster-recovery-test.yaml

Getting Help

Collecting Diagnostic Information

#!/bin/bash
# backup-restore-debug.sh - Collect diagnostic information

echo "=== Neo4j Backup/Restore Diagnostic Report ==="
echo "Generated: $(date)"
echo

echo "=== Cluster Information ==="
kubectl get neo4jenterpriseclusters
echo

echo "=== Backup Resources ==="
kubectl get neo4jbackups
echo

echo "=== Restore Resources ==="
kubectl get neo4jrestores
echo

echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20
echo

echo "=== Operator Logs (last 100 lines) ==="
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager --tail=100
echo

echo "=== Storage Classes ==="
kubectl get storageclass
echo

echo "=== PVCs ==="
kubectl get pvc

Support Resources

Documentation: Backup and Restore Guide
API Reference: Neo4jBackup, Neo4jRestore
Community: Neo4j Community Forum
Enterprise Support: Neo4j Support Portal

When to Contact Support

Contact support when:

Data corruption is suspected
Backup/restore operations consistently fail
Performance is significantly degraded
Security incidents occur
Complex PITR scenarios need assistance

Provide the diagnostic report and specific error messages when contacting support.

FilesExpand file tree

troubleshooting_backup_restore.md

Latest commit

History

troubleshooting_backup_restore.md

File metadata and controls

Backup and Restore Troubleshooting Guide

Prerequisites

Quick Diagnostic Commands

General Status Check

Job and Pod Status

Common Issues and Solutions

1. Version Compatibility Issues

Problem: Neo4j Version Not Supported

Problem: Invalid Version Format

2. Storage Backend Issues

Problem: S3 Access Denied

Problem: GCS Permission Denied

Problem: Azure Storage Authentication Failed

Problem: PVC Storage Issues

3. Backup Operation Issues

Problem: Backup Job Fails to Start

Problem: Backup Path Does Not Exist (Pre-Operator Fix)

Problem: Backup Times Out

Problem: Backup Verification Fails

4. Restore Operation Issues

Problem: Target Cluster Not Ready

Problem: Database Already Exists

Problem: PITR Transaction Log Issues

5. Networking and Connectivity Issues

Problem: Cannot Connect to Neo4j During Restore

6. Resource and Performance Issues

Problem: Out of Memory During Backup

Problem: Slow Backup Performance

7. Hook Execution Issues

Problem: Pre-restore Hook Fails

Problem: Cypher Hook Execution Fails

Advanced Troubleshooting

Debug Mode

Resource Monitoring

Network Debugging

Prevention and Best Practices

Monitoring Setup

Capacity Planning

Regular Testing

Getting Help

Collecting Diagnostic Information

Support Resources

When to Contact Support