Skip to content

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675

Closed
priyansh17 wants to merge 0 commit intovelero-io:mainfrom
priyansh17:main
Closed

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675
priyansh17 wants to merge 0 commit intovelero-io:mainfrom
priyansh17:main

Conversation

@priyansh17
Copy link
Copy Markdown
Collaborator

Please add a summary of your change

  • This bug can be reproduced by triggering a scenario where cloud snapshot provisioning fails (e.g., make Azure Disk run out of available capacity or permissions).
  • The problematic code path is in pkg/backup/actions/csi/volumesnapshot_action.go, in the Progress() method of volumeSnapshotBackupItemAction:
if boolptr.IsSetToTrue(vs.Status.ReadyToUse) {
    // ... continue to check VSC
} else if vs.Status.Error != nil {
    errorMessage := ""
    if vs.Status.Error.Message != nil {
        errorMessage = *vs.Status.Error.Message
    }
    p.log.Warnf("VolumeSnapshot has a temporary error %s. Snapshot controller will retry later.", errorMessage)
    return progress, nil  // <-- Returns NOT completed, no error propagated
}
  • This returns progress.Completed=false even for permanent errors, so the backup controller keeps polling Progress() for the full itemOperationTimeout (could be hours), instead of failing immediately.
  • The synchronous code path (WaitUntilVSCHandleIsReady) properly fails on timeout, but async Progress() does not.

The fix is to wait for a certain period of time before failing it/ skipping it.

Does your change fix a particular issue?

Fixes #(issue)
#9674

Please indicate you've done the following:

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.98%. Comparing base (15db9d2) to head (9149483).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9675      +/-   ##
==========================================
+ Coverage   60.96%   60.98%   +0.02%     
==========================================
  Files         384      384              
  Lines       36595    36615      +20     
==========================================
+ Hits        22310    22330      +20     
  Misses      12676    12676              
  Partials     1609     1609              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blackpiglet
Copy link
Copy Markdown
Contributor

@shubham-pampattiwar
I recall that the CSI plugin code previously included checks designed to prevent failures when encountering errors in the VolumeSnapshot (VS) and VolumeSnapshotContent (VSC) resources.

This mechanism was intended to tolerate transient provider errors, allowing the provider an opportunity to recover later.

However, I can no longer locate this code in the main branch.

I am uncertain when this code was removed and whether this functionality is still necessary here.

https://github.com/vmware-tanzu/velero/blob/d3f4b2c67e112b238a91007aaca6b4d3a01947ba/pkg/backup/actions/csi/volumesnapshot_action.go#L297-L300

kaovilai
kaovilai previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too have observed CSI snapshots that persists errors.. better to try again.. tho another approach to consider is to perhaps re-volumesnapshot a few times as persistent error may sometimes be that volumesnapshot (specific UID) specific (such as "the object has been modified; please apply your changes to the latest version and try again" on some drivers).

@shubham-pampattiwar
Copy link
Copy Markdown
Collaborator

@shubham-pampattiwar I recall that the CSI plugin code previously included checks designed to prevent failures when encountering errors in the VolumeSnapshot (VS) and VolumeSnapshotContent (VSC) resources.

This mechanism was intended to tolerate transient provider errors, allowing the provider an opportunity to recover later.

However, I can no longer locate this code in the main branch.

I am uncertain when this code was removed and whether this functionality is still necessary here.

https://github.com/vmware-tanzu/velero/blob/d3f4b2c67e112b238a91007aaca6b4d3a01947ba/pkg/backup/actions/csi/volumesnapshot_action.go#L297-L300

@blackpiglet The transient error tolerance you're remembering is still here for VS errors, it was never removed:
https://github.com/vmware-tanzu/velero/blob/e8fa708933b0ca173d319009d230a5316fce6a88/pkg/backup/actions/csi/volumesnapshot_action.go#L292-L300 I also have an draft PR (#8023) to add the same tolerance for VSC errors, but it hasn't been merged yet.

@priyansh17 Thanks for the fix. The problem is real, but I think we can simplify the approach. The operation start time is already available via progress.Started, so we can use time.Since(progress.Started) >= CSISnapshotTimeout to detect persistent errors without needing the annotation. Same result, less code.

Additionally, the VSC error path currently fails immediately with no transient tolerance at all: https://github.com/vmware-tanzu/velero/blob/e8fa708933b0ca173d319009d230a5316fce6a88/pkg/backup/actions/csi/volumesnapshot_action.go#L327-L332. Ideally we'd apply the same bounded-timeout pattern there too, tolerate errors within the CSISnapshotTimeout window, then fail. That would make both paths consistent.

Would you be open to reworking the approach along these lines?

@priyansh17
Copy link
Copy Markdown
Collaborator Author

Thank you for inputs here, @blackpiglet @kaovilai & @shubham-pampattiwar
I agree we should make similar fix for VSC as well, will update it to use the time.Since(progress.Started) instead of annotation. will update the PR.

@priyansh17
Copy link
Copy Markdown
Collaborator Author

Hello Folks, please re-review this.
@blackpiglet @shubham-pampattiwar @kaovilai @anshulahuja98

@reasonerjt reasonerjt requested a review from Lyndon-Li April 13, 2026 06:24
@reasonerjt
Copy link
Copy Markdown
Contributor

cc @Lyndon-Li
For disagreement about itemOperationTimeout .vs. CSISnapshotTimeout

@priyansh17
Copy link
Copy Markdown
Collaborator Author

cc @Lyndon-Li For disagreement about itemOperationTimeout .vs. CSISnapshotTimeout

Hi, please check this as well. #9674 (comment)

Copy link
Copy Markdown
Collaborator

@anshulahuja98 anshulahuja98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should keep using the CSI specific timeout, the async timeout will not help with early failure detection.

@priyansh17
Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants