What steps did you take and what happened:
When a user creates a DPA with defaultSnapshotMoveData: true but doesn't enable NodeAgent, backups hang in WaitingForPluginOperations for hours until timeout. The same issue exists for defaultVolumesToFSBackup. Both features require the NodeAgent DaemonSet to be running, but the DPA validator doesn't enforce this — the misconfiguration is only caught at backup time (and only for FSB; DataMover just silently hangs).
What did you expect to happen:
Backup fails if nodeagent is not running
Steps to Reproduce:
-
Create a DPA with CSI enabled.
-
Deploy a stateful application
-
Create a backup with SnapshotMoveData flag set as true
apiVersion: velero.io/v1
kind: Backup
metadata:
name: test-backup3
labels:
velero.io/storage-location: default
namespace: openshift-adp
spec:
includedNamespaces:
- ocp-mysql
storageLocation: ts-dpa-1
snapshotMoveData: true
Actual results:
Backup gets stuck in WaitingForPluginsOperation phase until it hits timeout error.
$ oc get backup test-backup3 -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
annotations:
velero.io/resource-timeout: 10m0s
velero.io/source-cluster-k8s-gitversion: v1.27.6+98158f9
velero.io/source-cluster-k8s-major-version: "1"
velero.io/source-cluster-k8s-minor-version: "27"
creationTimestamp: "2023-10-18T13:21:08Z"
generation: 5
labels:
velero.io/storage-location: ts-dpa-1
name: test-backup3
namespace: openshift-adp
resourceVersion: "212060"
uid: 1f252fb5-eb00-4efb-b576-12c6f7169a92
spec:
csiSnapshotTimeout: 10m0s
defaultVolumesToFsBackup: false
includedNamespaces:
- ocp-mysql
itemOperationTimeout: 4h0m0s
snapshotMoveData: true
storageLocation: ts-dpa-1
ttl: 720h0m0s
status:
backupItemOperationsAttempted: 2
expiration: "2023-11-17T13:21:08Z"
formatVersion: 1.1.0
phase: WaitingForPluginOperations
progress:
itemsBackedUp: 31
totalItems: 31
startTimestamp: "2023-10-18T13:21:09Z"
version: 1
Expected results:
Backup should get immediately failed in case nodeAgent pods are not running.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
- Velero version (use
velero version):
- Velero features (use
velero client config get features):
- Kubernetes version (use
kubectl version):
- Kubernetes installer & version:
- Cloud provider or hardware configuration:
- OS (e.g. from
/etc/os-release):
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- 👍 for "I would like to see this bug fixed as soon as possible"
- 👎 for "There are more important bugs to focus on right now"
What steps did you take and what happened:
When a user creates a DPA with defaultSnapshotMoveData: true but doesn't enable NodeAgent, backups hang in WaitingForPluginOperations for hours until timeout. The same issue exists for defaultVolumesToFSBackup. Both features require the NodeAgent DaemonSet to be running, but the DPA validator doesn't enforce this — the misconfiguration is only caught at backup time (and only for FSB; DataMover just silently hangs).
What did you expect to happen:
Backup fails if nodeagent is not running
Steps to Reproduce:
Create a DPA with CSI enabled.
Deploy a stateful application
Create a backup with SnapshotMoveData flag set as true
Actual results:
Backup gets stuck in WaitingForPluginsOperation phase until it hits timeout error.
Expected results:
Backup should get immediately failed in case nodeAgent pods are not running.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>to generate the support bundle, and attach to this issue, more options please refer tovelero debug --helpIf you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velerovelero backup describe <backupname>orkubectl get backup/<backupname> -n velero -o yamlvelero backup logs <backupname>velero restore describe <restorename>orkubectl get restore/<restorename> -n velero -o yamlvelero restore logs <restorename>Anything else you would like to add:
Environment:
velero version):velero client config get features):kubectl version):/etc/os-release):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.