Skip to content

Latest commit

 

History

History
781 lines (641 loc) · 16.8 KB

File metadata and controls

781 lines (641 loc) · 16.8 KB

Backup and Restore Troubleshooting Guide

This guide provides comprehensive troubleshooting information for Neo4j backup and restore operations using the Kubernetes operator. It covers common issues, diagnostic steps, and solutions for various backup and restore scenarios.

Note: Cluster deployments use a centralized {cluster}-backup pod (container backup). Backup sidecar references apply to standalone deployments.

Prerequisites

Before troubleshooting, ensure you have:

  • Neo4j Enterprise cluster running version 5.26.0+ (semver) or 2025.01.0+ (calver)
  • Appropriate RBAC permissions for backup/restore operations
  • Access to cluster logs and events
  • Understanding of your storage backend configuration

Quick Diagnostic Commands

General Status Check

# Check backup resource status
kubectl get neo4jbackups
kubectl get neo4jrestores

# View detailed resource information
kubectl describe neo4jbackup <backup-name>
kubectl describe neo4jrestore <restore-name>

# Check events
kubectl get events --sort-by=.metadata.creationTimestamp

# View operator logs
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager

Job and Pod Status

# List backup/restore jobs
kubectl get jobs -l app.kubernetes.io/component=backup
kubectl get jobs -l app.kubernetes.io/component=restore

# Check job logs
kubectl logs job/<backup-name>-backup
kubectl logs job/<restore-name>-restore

# Check pod status and logs
kubectl get pods -l app.kubernetes.io/component=backup
kubectl logs <backup-pod-name>

Common Issues and Solutions

1. Version Compatibility Issues

Problem: Neo4j Version Not Supported

Error: Neo4j version 5.25.0 is not supported. Minimum required version is 5.26.0

Diagnosis:

# Check cluster image version
kubectl get neo4jenterprisecluster <cluster-name> -o jsonpath='{.spec.image.tag}'

# Check backup/restore resource events
kubectl describe neo4jbackup <backup-name>

Solutions:

  1. Update Neo4j Version:

    spec:
      image:
        tag: "5.26.0-enterprise"  # or later version
  2. Verify Supported Versions:

    • Semver: 5.26.0, 5.26.1 (5.26.x is the last semver LTS — no 5.27+ exists)
    • Calver: 2025.01.0, 2025.06.1, 2026.01.0+

Problem: Invalid Version Format

Error: invalid Neo4j version format: latest. Expected semver (5.26+) or calver (2025.01+)

Solution: Use specific version tags instead of latest:

spec:
  image:
    tag: "5.26.0-enterprise"

2. Storage Backend Issues

Problem: S3 Access Denied

Error: AccessDenied: Access Denied

Diagnosis:

# Check AWS credentials
kubectl get secret aws-credentials -o yaml

# Verify IAM permissions
aws sts get-caller-identity
aws s3 ls s3://your-backup-bucket/

Solutions:

  1. Verify IAM Permissions:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject",
            "s3:ListBucket"
          ],
          "Resource": [
            "arn:aws:s3:::your-backup-bucket",
            "arn:aws:s3:::your-backup-bucket/*"
          ]
        }
      ]
    }
  2. Update Service Account Annotations:

    metadata:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/neo4j-backup-role
  3. Check Secret Format:

    apiVersion: v1
    kind: Secret
    metadata:
      name: aws-credentials
    data:
      AWS_ACCESS_KEY_ID: <base64-key>
      AWS_SECRET_ACCESS_KEY: <base64-secret>

Problem: GCS Permission Denied

Error: 403 Forbidden: Permission denied

Solutions:

  1. Verify Service Account Key:

    # Check service account secret
    kubectl get secret gcs-credentials -o yaml
    
    # Test GCS access
    gsutil ls gs://your-backup-bucket/
  2. Required GCS Permissions:

    • storage.objects.create
    • storage.objects.delete
    • storage.objects.get
    • storage.objects.list
    • storage.buckets.get

Problem: Azure Storage Authentication Failed

Error: AuthenticationFailed: Server failed to authenticate the request

Solutions:

  1. Check Storage Account Key:

    apiVersion: v1
    kind: Secret
    metadata:
      name: azure-credentials
    data:
      AZURE_STORAGE_ACCOUNT: <base64-account-name>
      AZURE_STORAGE_KEY: <base64-storage-key>
  2. Verify Container Permissions:

    # Test Azure CLI access
    az storage blob list --container-name your-container --account-name your-account

Problem: PVC Storage Issues

Error: pod has unbound immediate PersistentVolumeClaims

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check storage class
kubectl get storageclass

Solutions:

  1. Verify Storage Class:

    spec:
      storage:
        type: pvc
        pvc:
          name: backup-storage
          size: 100Gi
         storageClassName: fast-ssd  # Ensure this exists
  2. Check Available Storage:

    # List nodes and storage
    kubectl describe nodes
    kubectl get pv

3. Backup Operation Issues

Problem: Backup Job Fails to Start

Status: Failed
Message: Failed to create backup job: pods "backup-job-xyz" is forbidden

Diagnosis:

# Check RBAC permissions for backup job service account
kubectl auth can-i create pods/exec --as=system:serviceaccount:<namespace>:neo4j-backup-sa

# Check service account
kubectl get serviceaccount neo4j-backup-sa -o yaml

Solutions:

  1. Verify RBAC:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: neo4j-backup-role
    rules:
    - apiGroups: ["batch"]
      resources: ["jobs", "cronjobs"]
      verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  2. Check Service Account Binding:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: neo4j-backup-rolebinding
    subjects:
    - kind: ServiceAccount
      name: neo4j-backup-sa
      namespace: <namespace>
    roleRef:
      kind: Role
      name: neo4j-backup-role
      apiGroup: rbac.authorization.k8s.io

Problem: Backup Path Does Not Exist (Pre-Operator Fix)

Status: Failed
Message: org.neo4j.cli.CommandFailedException: Path '/data/backups/test-backup' does not exist

Note: This issue has been fixed in the latest operator version. The backup sidecar now automatically creates the backup path before executing the backup command.

If you encounter this with an older operator version:

Diagnosis (standalone only):

# Check operator version
kubectl get deployment -n neo4j-operator neo4j-operator-controller-manager -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check backup sidecar logs
kubectl logs <neo4j-pod> -c backup-sidecar

Solutions:

  1. Upgrade to Latest Operator:

    • The latest operator version automatically creates backup paths
    • Neo4j 5.26+ and 2025.x+ require paths to exist
  2. Temporary Workaround (if upgrade not possible):

    # Manually create backup directory in pod
    kubectl exec <neo4j-pod> -c backup-sidecar -- mkdir -p /data/backups/<backup-name>
  3. Verify Fix in New Version:

    # Check that backup sidecar includes mkdir command
    kubectl get pod <neo4j-pod> -o jsonpath='{.spec.containers[?(@.name=="backup-sidecar")].command}' | grep "mkdir -p"

For clusters, check the centralized backup pod instead:

kubectl logs <cluster>-backup-0 -c backup

Problem: Backup Times Out

Status: Failed
Message: Backup job timed out after 2h0m0s

Solutions:

  1. Increase Timeout:

    spec:
      timeout: "4h"  # Increase timeout for large databases
  2. Check Resource Limits:

    spec:
      options:
        additionalArgs:
          - "--parallel-recovery"
          - "--temp-path=/tmp/backup"
  3. Monitor Disk I/O:

    # Check node resources
    kubectl top nodes
    kubectl top pods

Problem: Backup Verification Fails

Status: Failed
Message: Backup verification failed: inconsistent data detected

Solutions:

  1. Check Database Consistency:

    // Connect to Neo4j and run
    CALL dbms.checkConsistency()
  2. Disable Verification Temporarily:

    spec:
      options:
        verify: false  # Disable for problematic databases
  3. Use Force Flag:

    spec:
      force: true

4. Restore Operation Issues

Problem: Target Cluster Not Ready

Status: Waiting
Message: Target cluster is not ready

Diagnosis:

# Check cluster status
kubectl get neo4jenterprisecluster <cluster-name>
kubectl describe neo4jenterprisecluster <cluster-name>

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=<cluster-name>

Solutions:

  1. Wait for Cluster Readiness:

    # Monitor cluster status
    kubectl get neo4jenterprisecluster <cluster-name> -w
  2. Check Cluster Configuration:

    # Ensure cluster has proper resources
    spec:
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"

Problem: Database Already Exists

Status: Failed
Message: database myapp already exists. Use replaceExisting option or force flag

Solutions:

  1. Use Replace Existing:

    spec:
      options:
        replaceExisting: true
  2. Use Force Flag:

    spec:
      force: true
  3. Drop Database First:

    // Connect to Neo4j and run
    DROP DATABASE myapp IF EXISTS

Problem: PITR Transaction Log Issues

Status: Failed
Message: transaction log validation failed: missing log segment

Diagnosis:

# Check transaction log storage
kubectl describe neo4jrestore <restore-name>

# Verify log storage accessibility
aws s3 ls s3://transaction-logs/production/logs/

Solutions:

  1. Check Log Retention:

    spec:
      source:
        pitr:
          logRetention: "14d"  # Increase retention period
  2. Disable Log Validation:

    spec:
      source:
        pitr:
          validateLogIntegrity: false
  3. Use Different Recovery Point:

    spec:
      source:
        pointInTime: "2025-01-04T10:00:00Z"  # Earlier time

5. Networking and Connectivity Issues

Problem: Cannot Connect to Neo4j During Restore

Status: Failed
Message: failed to create Neo4j client: connection refused

Diagnosis:

# Check Neo4j service
kubectl get svc -l app.kubernetes.io/instance=<cluster-name>

# Test connectivity
kubectl port-forward svc/<cluster-name>-client 7687:7687
neo4j-client -u neo4j -p password bolt://localhost:7687

Solutions:

  1. Check Service Configuration:

    # Ensure service is properly exposed
    spec:
      services:
        neo4j:
          enabled: true
          type: ClusterIP
  2. Verify Network Policies:

    kubectl get networkpolicies
    kubectl describe networkpolicy <policy-name>
  3. Check Firewall Rules:

    # Ensure port 7687 is accessible
    telnet <cluster-ip> 7687

6. Resource and Performance Issues

Problem: Out of Memory During Backup

Status: Failed
Message: backup job killed due to memory limit

Solutions:

  1. Increase Job Resources:

    # Add to backup job template (requires operator modification)
    resources:
      requests:
        memory: "4Gi"
        cpu: "2"
      limits:
        memory: "8Gi"
        cpu: "4"
  2. Use Incremental Backup:

    spec:
      options:
        additionalArgs:
          - "--incremental"
  3. Optimize Backup Path:

    spec:
      options:
        additionalArgs:
          - "--temp-path=/tmp/backup"
          - "--parallel-recovery"

Problem: Slow Backup Performance

Status: Running (for extended time)

Solutions:

  1. Enable Compression:

    spec:
      options:
        compress: true
  2. Use Parallel Processing:

    spec:
      options:
        additionalArgs:
          - "--parallel-recovery"
  3. Check Storage Performance:

    # Test storage I/O
    kubectl exec -it <backup-pod> -- dd if=/dev/zero of=/backup/test bs=1M count=1000

7. Hook Execution Issues

Problem: Pre-restore Hook Fails

Status: Failed
Message: Pre-restore hooks failed: hook job failed

Diagnosis:

# Check hook job status
kubectl get jobs -l app.kubernetes.io/component=pre-restore

# Check hook job logs
kubectl logs job/<restore-name>-pre-restore-hook

Solutions:

  1. Increase Hook Timeout:

    spec:
      options:
        preRestore:
          job:
            timeout: "30m"  # Increase timeout
  2. Fix Hook Script:

    spec:
      options:
        preRestore:
          job:
            template:
              container:
                command: ["/bin/sh"]
                args: ["-c", "set -e; /scripts/pre-restore.sh"]  # Add error handling

Problem: Cypher Hook Execution Fails

Status: Failed
Message: failed to execute Cypher statement: syntax error

Solutions:

  1. Validate Cypher Syntax:

    spec:
      options:
        postRestore:
          cypherStatements:
            - "CALL db.awaitIndexes(600)"  # Add timeout
            - "MATCH (n:User) WHERE n.created IS NULL SET n.created = datetime()"
  2. Check Database State:

    // Verify database is accessible
    CALL db.ping()

Advanced Troubleshooting

Debug Mode

Enable debug logging in the operator:

# Restart operator with debug logging
kubectl patch deployment neo4j-operator-controller-manager \
  -n neo4j-operator \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'

Resource Monitoring

Monitor resource usage during operations:

# Watch resource usage
watch kubectl top pods
watch kubectl top nodes

# Monitor storage usage
kubectl exec -it <backup-pod> -- df -h

Network Debugging

Test network connectivity:

# DNS resolution
kubectl exec -it <backup-pod> -- nslookup <cluster-name>-client

# Port connectivity
kubectl exec -it <backup-pod> -- telnet <cluster-name>-client 7687

# Network policies
kubectl get networkpolicies --all-namespaces

Prevention and Best Practices

Monitoring Setup

  1. Set up Alerts:

    # Prometheus alert for backup failures
    - alert: BackupFailed
      expr: increase(neo4j_backup_failures_total[1h]) > 0
      for: 5m
      annotations:
        summary: "Neo4j backup failed"
  2. Regular Health Checks:

    # Weekly backup validation
    kubectl get neo4jbackups -o json | jq '.items[] | select(.status.phase != "Completed")'

Capacity Planning

  1. Storage Monitoring:

    # Monitor backup storage growth
    kubectl get pvc -o jsonpath='{.items[*].status.capacity.storage}'
  2. Performance Baselines:

    # Establish backup performance baselines
    kubectl get neo4jbackup -o jsonpath='{.items[*].status.stats.duration}'

Regular Testing

  1. Backup Validation:

    # Monthly restore tests
    kubectl apply -f test-restore.yaml
  2. Disaster Recovery Drills:

    # Quarterly DR tests
    kubectl apply -f disaster-recovery-test.yaml

Getting Help

Collecting Diagnostic Information

#!/bin/bash
# backup-restore-debug.sh - Collect diagnostic information

echo "=== Neo4j Backup/Restore Diagnostic Report ==="
echo "Generated: $(date)"
echo

echo "=== Cluster Information ==="
kubectl get neo4jenterpriseclusters
echo

echo "=== Backup Resources ==="
kubectl get neo4jbackups
echo

echo "=== Restore Resources ==="
kubectl get neo4jrestores
echo

echo "=== Recent Events ==="
kubectl get events --sort-by=.metadata.creationTimestamp | tail -20
echo

echo "=== Operator Logs (last 100 lines) ==="
kubectl logs -n neo4j-operator deployment/neo4j-operator-controller-manager --tail=100
echo

echo "=== Storage Classes ==="
kubectl get storageclass
echo

echo "=== PVCs ==="
kubectl get pvc

Support Resources

When to Contact Support

Contact support when:

  • Data corruption is suspected
  • Backup/restore operations consistently fail
  • Performance is significantly degraded
  • Security incidents occur
  • Complex PITR scenarios need assistance

Provide the diagnostic report and specific error messages when contacting support.