Backup & Restore Troubleshooting Guide

This comprehensive troubleshooting guide covers common issues with Neo4j backup and restore operations when using the Neo4j Kubernetes Operator.

Overview

The Neo4j Kubernetes Operator provides comprehensive backup and restore capabilities including:

Automated backups with scheduling and retention policies
Point-in-Time Recovery (PITR) for Neo4j 2025.x
Multi-cloud storage support (S3, GCS, Azure Blob)
Backup sidecars automatically added to all pods
Automatic RBAC management for backup operations

Common Backup Issues

Backup Job Failures

Symptom: Backup job fails to start

kubectl get jobs -l app.kubernetes.io/component=backup
# STATUS: Failed or no jobs created

Diagnosis:

# Check backup resource status
kubectl get neo4jbackup
kubectl describe neo4jbackup production-backup

# Check operator logs for backup controller errors
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i backup

# Verify RBAC permissions
kubectl auth can-i create jobs --as=system:serviceaccount:default:production-cluster-backup

Common Causes & Solutions:

Missing RBAC Permissions:

# The operator automatically creates RBAC - check if it exists
kubectl get serviceaccount production-cluster-backup
kubectl get role production-cluster-backup-role
kubectl get rolebinding production-cluster-backup-binding

# If missing, trigger operator reconciliation
kubectl annotate neo4jenterprisecluster production-cluster operator.neo4j.com/force-reconcile="$(date +%s)"

Storage Configuration Issues:

# Verify storage configuration in backup spec
spec:
  storage:
    s3:
      bucket: "valid-bucket-name"    # Must exist
      region: "us-west-2"            # Correct region
      # Credentials must be valid

Cluster Reference Problems:

# Verify cluster exists and is ready
kubectl get neo4jenterprisecluster production-cluster
kubectl get pods -l neo4j.com/cluster=production-cluster

Symptom: Backup job starts but fails during execution

Diagnosis:

# Check backup job logs
kubectl logs job/production-backup-$(date +%Y%m%d)-001

# Check backup sidecar logs
kubectl logs production-cluster-server-0 -c backup-sidecar

# Check Neo4j server logs for backup-related errors
kubectl logs production-cluster-server-0 -c neo4j | grep -i backup

Common Solutions:

Insufficient Disk Space:

# Check available storage
kubectl exec production-cluster-server-0 -c backup-sidecar -- df -h /backup-staging

# Solution: Increase backup sidecar storage or cleanup old backups

Database Lock Issues:

# Check for long-running transactions
kubectl exec production-cluster-server-0 -- cypher-shell -u neo4j -p password \
  "CALL db.listTransactions() YIELD transactionId, elapsedTimeMillis WHERE elapsedTimeMillis > 30000"

# Solution: Wait for transactions to complete or consider using secondary for backup

Memory Issues in Backup Process:

# Increase backup sidecar resources
spec:
  backups:
    sidecar:
      resources:
        requests:
          memory: "1Gi"      # Increase from default 512Mi
        limits:
          memory: "2Gi"      # Increase from default 1Gi

Cloud Storage Issues

S3 Backup Failures

Authentication Issues:

# Check AWS credentials
kubectl exec production-cluster-server-0 -c backup-sidecar -- aws sts get-caller-identity

# Test S3 access
kubectl exec production-cluster-server-0 -c backup-sidecar -- aws s3 ls s3://your-backup-bucket/

Solutions:

IAM Role Issues:

# Use IAM roles for service accounts (IRSA)
spec:
  serviceAccount:
    name: production-cluster-backup
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/Neo4jBackupRole"

Bucket Policy Problems:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789:role/Neo4jBackupRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-backup-bucket",
        "arn:aws:s3:::your-backup-bucket/*"
      ]
    }
  ]
}

Google Cloud Storage Issues

Service Account Problems:

# Check GCP credentials
kubectl exec production-cluster-server-0 -c backup-sidecar -- gcloud auth list

# Test GCS access
kubectl exec production-cluster-server-0 -c backup-sidecar -- gsutil ls gs://your-backup-bucket/

Solutions:

# Use Workload Identity
spec:
  serviceAccount:
    name: production-cluster-backup
    annotations:
      iam.gke.io/gcp-service-account: "neo4j-backup@project.iam.gserviceaccount.com"

Azure Blob Storage Issues

Authentication Problems:

# Check Azure credentials
kubectl exec production-cluster-server-0 -c backup-sidecar -- az account show

# Test storage access
kubectl exec production-cluster-server-0 -c backup-sidecar -- az storage blob list --account-name storageaccount --container-name backups

Scheduled Backup Issues

Symptom: Scheduled backups not running

Diagnosis:

# Check CronJob status
kubectl get cronjob
kubectl describe cronjob production-backup-schedule

# Check backup schedule configuration
kubectl get neo4jbackup production-backup -o yaml | grep -A 10 schedule

Common Solutions:

Invalid Cron Expression:

# Correct cron syntax
spec:
  schedule: "0 2 * * *"    # Daily at 2 AM
  # NOT: "0 2 * * * *"     # Invalid - too many fields

Timezone Issues:

spec:
  schedule: "0 2 * * *"
  timezone: "UTC"          # Explicitly set timezone

Backup Window Conflicts:

# Check for overlapping backup jobs
kubectl get jobs -l app.kubernetes.io/component=backup --sort-by=.metadata.creationTimestamp

Common Restore Issues

Restore Job Failures

Symptom: Restore job fails to start

Diagnosis:

# Check restore resource status
kubectl get neo4jrestore
kubectl describe neo4jrestore production-restore

# Check operator logs
kubectl logs -n neo4j-operator-system deployment/neo4j-operator-controller-manager | grep -i restore

Common Solutions:

Invalid Backup Reference:

# Verify backup exists
kubectl get neo4jbackup production-backup

# Check backup completion status
kubectl get neo4jbackup production-backup -o jsonpath='{.status.phase}'

Target Cluster Issues:

# Ensure target cluster is ready
kubectl get neo4jenterprisecluster target-cluster
kubectl get pods -l neo4j.com/cluster=target-cluster

Storage Access Problems:

# Test access to backup storage location
kubectl exec target-cluster-server-0 -c backup-sidecar -- \
  aws s3 ls s3://backup-bucket/path/to/backup/

Symptom: Restore job fails during execution

Diagnosis:

# Check restore job logs
kubectl logs job/production-restore-$(date +%Y%m%d)

# Check target cluster logs during restore
kubectl logs target-cluster-server-0 | grep -i restore

Common Solutions:

Insufficient Storage Space:

# Check available space on target cluster
kubectl exec target-cluster-server-0 -- df -h /data

# Solution: Increase PVC size before restore

Database Already Exists:

# Use force option to overwrite
spec:
  options:
    force: true

Version Incompatibility:

# Check Neo4j versions
kubectl exec source-cluster-server-0 -- neo4j version
kubectl exec target-cluster-server-0 -- neo4j version

Point-in-Time Recovery (PITR) Issues

Symptom: PITR restore fails with timestamp errors

Diagnosis:

# Check backup logs for transaction timestamps
kubectl logs job/production-backup-latest | grep -i "restore-until"

# Verify PITR capability
kubectl exec production-cluster-server-0 -- neo4j-admin database info system

Solutions:

Invalid Timestamp Format:

# Correct ISO 8601 format
spec:
  restoreUntil: "2025-01-15T14:30:00Z"
  # NOT: "2025-01-15 14:30:00"

Timestamp Outside Backup Range:

# Check backup time range
kubectl logs job/production-backup-20250115 | grep -E "(start|end).*time"

Neo4j Version Compatibility:

# PITR only available in Neo4j 2025.x
spec:
  image:
    repository: "neo4j"
    tag: "2025.01.0-enterprise"

Backup Sidecar Issues

Sidecar Container Problems

Symptom: Backup sidecar fails to start

Diagnosis:

# Check sidecar status
kubectl get pods -l neo4j.com/cluster=production-cluster -o wide
kubectl describe pod production-cluster-server-0

# Check sidecar logs
kubectl logs production-cluster-server-0 -c backup-sidecar

Common Solutions:

Resource Constraints:

# Increase sidecar resources
spec:
  backups:
    sidecar:
      resources:
        requests:
          memory: "512Mi"
          cpu: "200m"
        limits:
          memory: "1Gi"
          cpu: "500m"

Storage Mount Issues:

# Check volume mounts
kubectl describe pod production-cluster-server-0 | grep -A 10 "Mounts:"

Permission Problems:

# Check file permissions
kubectl exec production-cluster-server-0 -c backup-sidecar -- ls -la /backup-requests
kubectl exec production-cluster-server-0 -c backup-sidecar -- id

Backup Request Processing Issues

Symptom: Backup requests not processed by sidecar

Diagnosis:

# Check backup request queue
kubectl exec production-cluster-server-0 -c backup-sidecar -- ls -la /backup-requests/

# Test manual backup request
kubectl exec production-cluster-server-0 -c backup-sidecar -- sh -c \
  'echo "{\"path\":\"/data/backups/manual-test\",\"type\":\"FULL\"}" > /backup-requests/test.request'

Solutions:

Request Format Issues:

// Correct format
{
  "path": "/data/backups/test",
  "type": "FULL",
  "databases": ["neo4j", "system"]
}

Sidecar Communication Problems:

# Check shared volume
kubectl exec production-cluster-server-0 -c neo4j -- ls -la /backup-requests/
kubectl exec production-cluster-server-0 -c backup-sidecar -- ls -la /backup-requests/

Performance Issues

Slow Backup Performance

Diagnosis:

# Monitor backup progress
kubectl logs job/production-backup-latest -f

# Check resource utilization during backup
kubectl top pod production-cluster-server-0

Optimization Strategies:

Use Secondary Servers for Backup:

spec:
  backupSource: "secondary"  # Backup from secondary to reduce primary load

Parallel Backup Processing:

spec:
  backups:
    parallelism: 2           # Multiple backup jobs can run simultaneously

Storage Performance Tuning:

# Use high-performance storage for backup staging
spec:
  backups:
    sidecar:
      storage:
        className: "fast-ssd"
        size: "100Gi"

Network Optimization:

spec:
  config:
    # Increase buffer sizes for backup operations
    dbms.memory.off_heap.max_size: "2g"
    dbms.memory.pagecache.size: "4g"

Slow Restore Performance

Optimization:

Target Cluster Resources:

spec:
  resources:
    requests:
      memory: "8Gi"
      cpu: "4"
    limits:
      memory: "16Gi"
      cpu: "8"

Storage Configuration:

spec:
  storage:
    className: "fast-ssd"
    size: "1Ti"

Monitoring and Alerting

Backup Health Monitoring

Prometheus Metrics:

# Monitor backup success rate
neo4j_backup_success_total
neo4j_backup_failure_total
neo4j_backup_duration_seconds

# Alert rules
groups:
- name: neo4j-backup
  rules:
  - alert: BackupFailure
    expr: increase(neo4j_backup_failure_total[24h]) > 0
    labels:
      severity: critical
    annotations:
      summary: "Neo4j backup failed"
      description: "Backup for cluster {{ $labels.cluster }} failed"

Log Monitoring:

# Monitor backup logs
kubectl logs -f job/production-backup-latest | grep -E "(ERROR|WARN|SUCCESS)"

# Set up log alerts
kubectl logs -f -n neo4j-operator-system deployment/neo4j-operator-controller-manager | \
  grep -i "backup.*failed" --line-buffered | \
  while read line; do
    echo "BACKUP ALERT: $line"
    # Send to alerting system
  done

Backup Validation

Automated Validation Script:

#!/bin/bash
# Validate backup completeness

BACKUP_NAME="production-backup"
NAMESPACE="default"

validate_backup() {
  local backup_status=$(kubectl get neo4jbackup $BACKUP_NAME -n $NAMESPACE -o jsonpath='{.status.phase}')

  if [ "$backup_status" != "Succeeded" ]; then
    echo "❌ Backup failed or incomplete: $backup_status"
    return 1
  fi

  # Check backup size
  local backup_size=$(kubectl get neo4jbackup $BACKUP_NAME -n $NAMESPACE -o jsonpath='{.status.backupSize}')
  if [ "$backup_size" -lt 1000000 ]; then  # Less than 1MB
    echo "⚠️  Backup size suspiciously small: $backup_size bytes"
  fi

  echo "✅ Backup validation passed"
  return 0
}

# Run validation
validate_backup

Emergency Recovery Procedures

Complete Database Recovery

Scenario: Primary database corrupted, need complete restore

# 1. Create new cluster for restoration
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jEnterpriseCluster
metadata:
  name: recovery-cluster
spec:
  topology:
    servers: 3
  # Use same configuration as original cluster
  storage:
    className: "fast-ssd"
    size: "1Ti"
EOF

# 2. Wait for cluster to be ready
kubectl wait --for=condition=Ready neo4jenterprisecluster/recovery-cluster --timeout=600s

# 3. Restore from latest backup
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jRestore
metadata:
  name: emergency-restore
spec:
  targetCluster: recovery-cluster
  source:
    backupName: production-backup-latest
  databaseName: neo4j
  force: true
EOF

# 4. Monitor restore progress
kubectl logs -f job/emergency-restore

# 5. Verify data integrity
kubectl exec recovery-cluster-server-0 -- cypher-shell -u neo4j -p password \
  "MATCH (n) RETURN count(n) as total_nodes"

Point-in-Time Emergency Recovery

# Restore to specific point before corruption
kubectl apply -f - <<EOF
apiVersion: neo4j.neo4j.com/v1alpha1
kind: Neo4jRestore
metadata:
  name: pitr-emergency-restore
spec:
  targetCluster: recovery-cluster
  source:
    backupName: production-backup-latest
  databaseName: neo4j
  options:
    restoreUntil: "2025-01-15T10:30:00Z"  # Before corruption occurred
  force: true
EOF

Best Practices Summary

Backup Best Practices

Regular Testing: Test backup and restore procedures regularly
Multiple Storage Locations: Store backups in multiple locations/regions
Retention Policies: Implement appropriate retention policies
Monitoring: Set up comprehensive backup monitoring and alerting
Documentation: Document recovery procedures and test them
Security: Encrypt backups and use secure storage access

Restore Best Practices

Validation: Always validate restored data integrity
Staging Environment: Test restores in staging before production
Downtime Planning: Plan for service interruption during restore
Data Consistency: Ensure cluster consistency after restore
Application Testing: Test applications after database restore

Performance Best Practices

Resource Allocation: Adequate resources for backup/restore operations
Storage Performance: Use high-performance storage for operations
Network Optimization: Optimize network for data transfer
Scheduling: Schedule backups during low-activity periods
Parallel Operations: Use parallelism where possible

For additional help, see:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup & Restore Troubleshooting Guide

Overview

Common Backup Issues

Backup Job Failures

Symptom: Backup job fails to start

Symptom: Backup job starts but fails during execution

Cloud Storage Issues

S3 Backup Failures

Google Cloud Storage Issues

Azure Blob Storage Issues

Scheduled Backup Issues

Symptom: Scheduled backups not running

Common Restore Issues

Restore Job Failures

Symptom: Restore job fails to start

Symptom: Restore job fails during execution

Point-in-Time Recovery (PITR) Issues

Symptom: PITR restore fails with timestamp errors

Backup Sidecar Issues

Sidecar Container Problems

Symptom: Backup sidecar fails to start

Backup Request Processing Issues

Symptom: Backup requests not processed by sidecar

Performance Issues

Slow Backup Performance

Slow Restore Performance

Monitoring and Alerting

Backup Health Monitoring

Backup Validation

Emergency Recovery Procedures

Complete Database Recovery

Point-in-Time Emergency Recovery

Best Practices Summary

Backup Best Practices

Restore Best Practices

Performance Best Practices

FilesExpand file tree

backup_restore.md

Latest commit

History

backup_restore.md

File metadata and controls

Backup & Restore Troubleshooting Guide

Overview

Common Backup Issues

Backup Job Failures

Symptom: Backup job fails to start

Symptom: Backup job starts but fails during execution

Cloud Storage Issues

S3 Backup Failures

Google Cloud Storage Issues

Azure Blob Storage Issues

Scheduled Backup Issues

Symptom: Scheduled backups not running

Common Restore Issues

Restore Job Failures

Symptom: Restore job fails to start

Symptom: Restore job fails during execution

Point-in-Time Recovery (PITR) Issues

Symptom: PITR restore fails with timestamp errors

Backup Sidecar Issues

Sidecar Container Problems

Symptom: Backup sidecar fails to start

Backup Request Processing Issues

Symptom: Backup requests not processed by sidecar

Performance Issues

Slow Backup Performance

Slow Restore Performance

Monitoring and Alerting

Backup Health Monitoring

Backup Validation

Emergency Recovery Procedures

Complete Database Recovery

Point-in-Time Emergency Recovery

Best Practices Summary

Backup Best Practices

Restore Best Practices

Performance Best Practices