fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675
fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675priyansh17 wants to merge 0 commit intovelero-io:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #9675 +/- ##
==========================================
+ Coverage 60.96% 60.98% +0.02%
==========================================
Files 384 384
Lines 36595 36615 +20
==========================================
+ Hits 22310 22330 +20
Misses 12676 12676
Partials 1609 1609 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@shubham-pampattiwar This mechanism was intended to tolerate transient provider errors, allowing the provider an opportunity to recover later. However, I can no longer locate this code in the main branch. I am uncertain when this code was removed and whether this functionality is still necessary here. |
There was a problem hiding this comment.
I too have observed CSI snapshots that persists errors.. better to try again.. tho another approach to consider is to perhaps re-volumesnapshot a few times as persistent error may sometimes be that volumesnapshot (specific UID) specific (such as "the object has been modified; please apply your changes to the latest version and try again" on some drivers).
@blackpiglet The transient error tolerance you're remembering is still here for VS errors, it was never removed: @priyansh17 Thanks for the fix. The problem is real, but I think we can simplify the approach. The operation start time is already available via Additionally, the VSC error path currently fails immediately with no transient tolerance at all: https://github.com/vmware-tanzu/velero/blob/e8fa708933b0ca173d319009d230a5316fce6a88/pkg/backup/actions/csi/volumesnapshot_action.go#L327-L332. Ideally we'd apply the same bounded-timeout pattern there too, tolerate errors within the Would you be open to reworking the approach along these lines? |
|
Thank you for inputs here, @blackpiglet @kaovilai & @shubham-pampattiwar |
|
Hello Folks, please re-review this. |
|
cc @Lyndon-Li |
Hi, please check this as well. #9674 (comment) |
anshulahuja98
left a comment
There was a problem hiding this comment.
LGTM. I think we should keep using the CSI specific timeout, the async timeout will not help with early failure detection.
Please add a summary of your change
pkg/backup/actions/csi/volumesnapshot_action.go, in theProgress()method ofvolumeSnapshotBackupItemAction:progress.Completed=falseeven for permanent errors, so the backup controller keeps polling Progress() for the fullitemOperationTimeout(could be hours), instead of failing immediately.WaitUntilVSCHandleIsReady) properly fails on timeout, but async Progress() does not.The fix is to wait for a certain period of time before failing it/ skipping it.
Does your change fix a particular issue?
Fixes #(issue)
#9674
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.