fix: fail backup early when CSI VolumeSnapshot error persists in async poll by priyansh17 · Pull Request #9675 · velero-io/velero

priyansh17 · 2026-04-06T22:17:25Z

Please add a summary of your change

This bug can be reproduced by triggering a scenario where cloud snapshot provisioning fails (e.g., make Azure Disk run out of available capacity or permissions).
The problematic code path is in pkg/backup/actions/csi/volumesnapshot_action.go, in the Progress() method of volumeSnapshotBackupItemAction:

if boolptr.IsSetToTrue(vs.Status.ReadyToUse) {
    // ... continue to check VSC
} else if vs.Status.Error != nil {
    errorMessage := ""
    if vs.Status.Error.Message != nil {
        errorMessage = *vs.Status.Error.Message
    }
    p.log.Warnf("VolumeSnapshot has a temporary error %s. Snapshot controller will retry later.", errorMessage)
    return progress, nil  // <-- Returns NOT completed, no error propagated
}

This returns progress.Completed=false even for permanent errors, so the backup controller keeps polling Progress() for the full itemOperationTimeout (could be hours), instead of failing immediately.
The synchronous code path (WaitUntilVSCHandleIsReady) properly fails on timeout, but async Progress() does not.

The fix is to wait for a certain period of time before failing it/ skipping it.

Does your change fix a particular issue?

Fixes #(issue)
#9674

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
Updated the corresponding documentation in site/content/docs/main.

codecov · 2026-04-06T22:28:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.98%. Comparing base (15db9d2) to head (9149483).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9675      +/-   ##
==========================================
+ Coverage   60.96%   60.98%   +0.02%     
==========================================
  Files         384      384              
  Lines       36595    36615      +20     
==========================================
+ Hits        22310    22330      +20     
  Misses      12676    12676              
  Partials     1609     1609

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blackpiglet · 2026-04-07T08:55:08Z

@shubham-pampattiwar
I recall that the CSI plugin code previously included checks designed to prevent failures when encountering errors in the VolumeSnapshot (VS) and VolumeSnapshotContent (VSC) resources.

This mechanism was intended to tolerate transient provider errors, allowing the provider an opportunity to recover later.

However, I can no longer locate this code in the main branch.

I am uncertain when this code was removed and whether this functionality is still necessary here.

https://github.com/vmware-tanzu/velero/blob/d3f4b2c67e112b238a91007aaca6b4d3a01947ba/pkg/backup/actions/csi/volumesnapshot_action.go#L297-L300

kaovilai

I too have observed CSI snapshots that persists errors.. better to try again.. tho another approach to consider is to perhaps re-volumesnapshot a few times as persistent error may sometimes be that volumesnapshot (specific UID) specific (such as "the object has been modified; please apply your changes to the latest version and try again" on some drivers).

shubham-pampattiwar · 2026-04-07T23:26:05Z

@shubham-pampattiwar I recall that the CSI plugin code previously included checks designed to prevent failures when encountering errors in the VolumeSnapshot (VS) and VolumeSnapshotContent (VSC) resources.

This mechanism was intended to tolerate transient provider errors, allowing the provider an opportunity to recover later.

However, I can no longer locate this code in the main branch.

I am uncertain when this code was removed and whether this functionality is still necessary here.

https://github.com/vmware-tanzu/velero/blob/d3f4b2c67e112b238a91007aaca6b4d3a01947ba/pkg/backup/actions/csi/volumesnapshot_action.go#L297-L300

@blackpiglet The transient error tolerance you're remembering is still here for VS errors, it was never removed:
https://github.com/vmware-tanzu/velero/blob/e8fa708933b0ca173d319009d230a5316fce6a88/pkg/backup/actions/csi/volumesnapshot_action.go#L292-L300 I also have an draft PR (#8023) to add the same tolerance for VSC errors, but it hasn't been merged yet.

@priyansh17 Thanks for the fix. The problem is real, but I think we can simplify the approach. The operation start time is already available via progress.Started, so we can use time.Since(progress.Started) >= CSISnapshotTimeout to detect persistent errors without needing the annotation. Same result, less code.

Additionally, the VSC error path currently fails immediately with no transient tolerance at all: https://github.com/vmware-tanzu/velero/blob/e8fa708933b0ca173d319009d230a5316fce6a88/pkg/backup/actions/csi/volumesnapshot_action.go#L327-L332. Ideally we'd apply the same bounded-timeout pattern there too, tolerate errors within the CSISnapshotTimeout window, then fail. That would make both paths consistent.

Would you be open to reworking the approach along these lines?

priyansh17 · 2026-04-08T05:45:07Z

Thank you for inputs here, @blackpiglet @kaovilai & @shubham-pampattiwar
I agree we should make similar fix for VSC as well, will update it to use the time.Since(progress.Started) instead of annotation. will update the PR.

priyansh17 · 2026-04-10T08:30:36Z

Hello Folks, please re-review this.
@blackpiglet @shubham-pampattiwar @kaovilai @anshulahuja98

reasonerjt · 2026-04-13T06:28:03Z

cc @Lyndon-Li
For disagreement about itemOperationTimeout .vs. CSISnapshotTimeout

priyansh17 · 2026-04-14T06:53:32Z

cc @Lyndon-Li For disagreement about itemOperationTimeout .vs. CSISnapshotTimeout

Hi, please check this as well. #9674 (comment)

anshulahuja98

LGTM. I think we should keep using the CSI specific timeout, the async timeout will not help with early failure detection.

priyansh17 · 2026-04-30T06:15:41Z

Updated PR: https://github.com/velero-io/velero/pull/9727/changes

github-actions Bot requested review from anshulahuja98 and kaovilai April 6, 2026 22:17

github-actions Bot assigned priyansh17 Apr 6, 2026

github-actions Bot added has-unit-tests has-changelog labels Apr 6, 2026

priyansh17 force-pushed the main branch from 3074ee9 to 76d0cb8 Compare April 6, 2026 23:29

priyansh17 requested review from blackpiglet and shubham-pampattiwar April 7, 2026 06:24

blackpiglet force-pushed the main branch from 76d0cb8 to 543b77a Compare April 7, 2026 08:45

kaovilai previously approved these changes Apr 7, 2026

View reviewed changes

priyansh17 dismissed kaovilai’s stale review via 026f8cd April 8, 2026 16:01

priyansh17 requested a review from kaovilai April 8, 2026 16:02

reasonerjt requested a review from Lyndon-Li April 13, 2026 06:24

anshulahuja98 approved these changes Apr 14, 2026

View reviewed changes

priyansh17 closed this Apr 15, 2026

priyansh17 force-pushed the main branch from 9149483 to c5fa50b Compare April 15, 2026 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9675
priyansh17 wants to merge 0 commit intovelero-io:mainfrom
priyansh17:main

priyansh17 commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

blackpiglet commented Apr 7, 2026

Uh oh!

kaovilai left a comment •

edited

Loading

Uh oh!

shubham-pampattiwar commented Apr 7, 2026

Uh oh!

priyansh17 commented Apr 8, 2026

Uh oh!

priyansh17 commented Apr 10, 2026

Uh oh!

reasonerjt commented Apr 13, 2026

Uh oh!

priyansh17 commented Apr 14, 2026

Uh oh!

anshulahuja98 left a comment

Uh oh!

priyansh17 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

priyansh17 commented Apr 6, 2026

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

Uh oh!

codecov Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

blackpiglet commented Apr 7, 2026

Uh oh!

kaovilai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubham-pampattiwar commented Apr 7, 2026

Uh oh!

priyansh17 commented Apr 8, 2026

Uh oh!

priyansh17 commented Apr 10, 2026

Uh oh!

reasonerjt commented Apr 13, 2026

Uh oh!

priyansh17 commented Apr 14, 2026

Uh oh!

anshulahuja98 left a comment

Choose a reason for hiding this comment

Uh oh!

priyansh17 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov Bot commented Apr 6, 2026 •

edited

Loading

kaovilai left a comment •

edited

Loading