Fail backup validation when built-in data mover has no running node-agent#9697
Fail backup validation when built-in data mover has no running node-agent#9697Joeavaikath wants to merge 11 commits intovelero-io:mainfrom
Conversation
When SnapshotMoveData is enabled but no node-agent pods are running, DataUpload CRs sit unprocessed until timeout. This adds a pre-flight check in the PVC backup item action that verifies running node-agent pods exist before creating DataUpload CRs, mirroring the check FSB already performs. Cleans up the VolumeSnapshot on failure to prevent orphaned resources. Signed-off-by: Joseph <jvaikath@redhat.com>
116fd60 to
dca63e7
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9697 +/- ##
==========================================
+ Coverage 60.94% 60.97% +0.02%
==========================================
Files 384 384
Lines 36594 36614 +20
==========================================
+ Hits 22303 22325 +22
+ Misses 12681 12678 -3
- Partials 1610 1611 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Joeavaikath Please check with Velero data mover design, Velero supports customized data movers which don't rely on node-agent necessarily, so it is not reasonable to check the status of node-agent. |
|
@Lyndon-Li Could add a check to ensure it's the default datamover before the nodeagent pod running check runs |
Custom data movers operate independently of node-agent, so the HasRunningPods check should only run when the built-in "velero" data mover is in use. Adds test case verifying custom data movers bypass the check. Signed-off-by: Joseph <jvaikath@redhat.com>
| if datamover.IsBuiltInUploader(backup.Spec.DataMover) { | ||
| if err := nodeagent.HasRunningPods(context.Background(), backup.Namespace, p.crClient); err != nil { | ||
| dataUploadLog.WithError(err).Error("cannot perform snapshot data movement without running node-agent pods") | ||
| csi.CleanupVolumeSnapshot(vs, p.crClient, p.log) |
There was a problem hiding this comment.
It is not reasonable to do this check after creating the snapshot
There was a problem hiding this comment.
Moved the check further up
| } | ||
|
|
||
| // HasRunningPods checks if any node agent pod is running in the namespace through controller client. If not, return the error found. | ||
| func HasRunningPods(ctx context.Context, namespace string, crClient ctrlclient.Client) error { |
There was a problem hiding this comment.
From the plugin, we cannot make a decisive check:
It doesn't know which node the data mover is going to run
So as the code here, it only check if any node-agent pod is running
But that is not enough to guarantee the data mover could be processed by the node-agent pod
Then in what extend this code change would help?
There was a problem hiding this comment.
The intention is to fast-fail. If node-agent is not deployed or no pods are running, DataUpload will time out as the issue describes
Backup times out at 4h at phase WaitingForPluginOperations, this check stops it from hanging
It is a "will any node-agent run this" check
There was a problem hiding this comment.
Yes, but I will doubt how this "will any node-agent run this" check would help, as mentioned above, in many cases, the data mover pods will run in a dedicated node only and require the node-agent pod running in that node only.
Signed-off-by: Joseph <jvaikath@redhat.com>
The node-agent running check now lives in prepareBackupRequest() alongside other pre-flight validations (BSL availability, snapshot locations). This fails the entire backup with FailedValidation instead of producing per-PVC errors, and avoids creating snapshots that would need cleanup. Signed-off-by: Joseph <jvaikath@redhat.com>
|
@Lyndon-Li I think it properly lives in validation now: we are ensuring that there is at least one node-agent pod running before proceeding |
The node-agent validation added in prepareBackupRequest causes test cases with SnapshotMoveData enabled to fail validation. Add a running node-agent pod to the baseline test environment so all cases pass the check. Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
is this correct link? its very old. |
Replace HasRunningPods (which listed all node-agent pods and iterated to find a running one) with IsReady, which does a single Get on the node-agent DaemonSet and checks status.numberReady > 0. This is a cheaper check and more semantically appropriate since we only need to know the DaemonSet is healthy, not inspect individual pods. Signed-off-by: Joseph <jvaikath@redhat.com>
Summary
When
snapshotMoveData: trueis set with the built-in data mover butnode-agent is not ready, backups hang in
WaitingForPluginOperationsuntil
itemOperationTimeout(default 4h) expires. The DataUpload CR iscreated but never reconciled because the DataUpload controller runs
inside node-agent pods.
This PR adds a pre-flight validation check in
prepareBackupRequest()that fails the backup with
FailedValidationif the built-in datamover is requested but the node-agent DaemonSet has no ready pods. The
check inspects
status.numberReadyon the DaemonSet rather thanlisting individual pods — a single object lookup instead of scanning
all pods. The check is scoped to the built-in data mover only — custom
data movers that don't rely on node-agent are unaffected.
Changes:
pkg/nodeagent/node_agent.go— addIsReady()function that checksthe node-agent DaemonSet's
status.numberReadypkg/nodeagent/node_agent_test.go— unit tests forIsReady()pkg/controller/backup_controller.go— callnodeagent.IsReady()inprepareBackupRequest(), consistent with existing BSL/snapshotlocation checks
pkg/controller/backup_controller_test.go— 5 test cases (no DS, DSnot ready, DS ready, custom mover, disabled)
Does your change fix a particular issue?
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.