Skip to content

Backup should immediately fail when nodeAgent pods are not running #9698

@Joeavaikath

Description

@Joeavaikath

What steps did you take and what happened:

When a user creates a DPA with defaultSnapshotMoveData: true but doesn't enable NodeAgent, backups hang in WaitingForPluginOperations for hours until timeout. The same issue exists for defaultVolumesToFSBackup. Both features require the NodeAgent DaemonSet to be running, but the DPA validator doesn't enforce this — the misconfiguration is only caught at backup time (and only for FSB; DataMover just silently hangs).

What did you expect to happen:

Backup fails if nodeagent is not running

Steps to Reproduce:

  1. Create a DPA with CSI enabled. 

  2. Deploy a stateful application

  3. Create a backup with SnapshotMoveData flag set as true

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: test-backup3
  labels:
    velero.io/storage-location: default
  namespace: openshift-adp
spec:
  includedNamespaces:
  - ocp-mysql
  storageLocation: ts-dpa-1
  snapshotMoveData: true

Actual results:

Backup gets stuck in WaitingForPluginsOperation phase until it hits timeout error. 

$ oc get backup test-backup3 -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    velero.io/resource-timeout: 10m0s
    velero.io/source-cluster-k8s-gitversion: v1.27.6+98158f9
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: "27"
  creationTimestamp: "2023-10-18T13:21:08Z"
  generation: 5
  labels:
    velero.io/storage-location: ts-dpa-1
  name: test-backup3
  namespace: openshift-adp
  resourceVersion: "212060"
  uid: 1f252fb5-eb00-4efb-b576-12c6f7169a92
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  includedNamespaces:
  - ocp-mysql
  itemOperationTimeout: 4h0m0s
  snapshotMoveData: true
  storageLocation: ts-dpa-1
  ttl: 720h0m0s
status:
  backupItemOperationsAttempted: 2
  expiration: "2023-11-17T13:21:08Z"
  formatVersion: 1.1.0
  phase: WaitingForPluginOperations
  progress:
    itemsBackedUp: 31
    totalItems: 31
  startTimestamp: "2023-10-18T13:21:09Z"
  version: 1

 

Expected results:

Backup should get immediately failed in case nodeAgent pods are not running. 

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions