Skip to content

Add delay to avoid race conditions during VolumeSnapshotContent deletion#9700

Merged
kaovilai merged 4 commits intovelero-io:mainfrom
priyansh17:priyansh/issue-9699
Apr 14, 2026
Merged

Add delay to avoid race conditions during VolumeSnapshotContent deletion#9700
kaovilai merged 4 commits intovelero-io:mainfrom
priyansh17:priyansh/issue-9699

Conversation

@priyansh17
Copy link
Copy Markdown
Collaborator

Signed-off-by: Priyansh Choudhary im1706@gmail.com
Adda delay to avoid race conditions for a CSI Snapshotter bug in K8s Sidecar controller.
Details added in the issue linked below.

Does your change fix a particular issue?

Fixes #9699

Please indicate you've done the following:

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.94%. Comparing base (15db9d2) to head (ee477c9).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9700      +/-   ##
==========================================
- Coverage   60.96%   60.94%   -0.02%     
==========================================
  Files         384      384              
  Lines       36595    36596       +1     
==========================================
- Hits        22310    22305       -5     
- Misses      12676    12681       +5     
- Partials     1609     1610       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@reasonerjt reasonerjt removed their request for review April 13, 2026 06:49
blackpiglet
blackpiglet previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Collaborator

@shubham-pampattiwar shubham-pampattiwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @priyansh17.
The blind time.Sleep(2s) concerns me -- it's not guaranteed sufficient under load, and it adds unnecessary delay on clusters where the upstream fix is already present (external-snapshotter >= v9.2.0). For backups with many CSI snapshots the cumulative cost adds up.

Your issue actually suggests polling for snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection finalizer instead. Could we use wait.PollUntilContextTimeout to check for it? That way we return as soon as the sidecar has actually processed the object rather than guessing how long to wait.
Thoughts ?
cc @blackpiglet

Comment thread changelogs/unreleased/9700-priyansh17 Outdated
@priyansh17
Copy link
Copy Markdown
Collaborator Author

Thanks for the fix @priyansh17.
The blind time.Sleep(2s) concerns me -- it's not guaranteed sufficient under load, and it adds unnecessary delay on clusters where the upstream fix is already present (external-snapshotter >= v9.2.0). For backups with many CSI snapshots the cumulative cost adds up.

Your issue actually suggests polling for snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection finalizer instead. Could we use wait.PollUntilContextTimeout to check for it? That way we return as soon as the sidecar has actually processed the object rather than guessing how long to wait.
Thoughts ?
cc @blackpiglet

Hi, Thanks @shubham-pampattiwar for reviewing.
The source bug where I pointed out the problem was occurring when the vsc cleanup deletion was triggered already but sidecar did not pick it up, in these cases finaliser was already added to it and it was ready for deletion but never processed.
It did not make to the informer queue of sidecar.

I agree for new K8s clusters it is not required but it prevents any unwanted behaviour if the sidecar logic changes in future as well for handling concurrent CRUD operations.
I also tested this under load where I added 10,000 fake VSCs and did GC of backups with 200 snapshots, with the gap in creation and deletion the csi was able to perform well and it completed successfully on older k8s versions.

Earlier we had a poll on readyToUse which was unnecessary but solving this problem, as we removed it I intend to add this gap.

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
@priyansh17 priyansh17 force-pushed the priyansh/issue-9699 branch from c5247f5 to 49953ad Compare April 14, 2026 10:45
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sleeping, should it be watching VSC for status updates/bound/avail phase etc which would be more indicative of storage driver recognizing the operation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that discussion was had.. alas better than what we had before.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Thanks. Shubham had same point which was discussed

@kaovilai kaovilai merged commit df2686c into velero-io:main Apr 14, 2026
57 of 58 checks passed
priyansh17 added a commit to priyansh17/velero that referenced this pull request Apr 15, 2026
…ion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
emirot pushed a commit to emirot/velero that referenced this pull request Apr 16, 2026
…ion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>
kaovilai pushed a commit that referenced this pull request Apr 16, 2026
* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* fix: backup deletion silently succeeds when tarball download fails (#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* irregular volume size

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* Update the "community" page of website (#9722)

Update the community page to add the correct links to community meeting
and meeting notes.
I also removed the referece of google group as I confirmed the last
message was sent 2 years ago.

Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

---------

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Co-authored-by: Priyansh Choudhary <im1706@gmail.com>
Co-authored-by: nolanemirot <nolan.emirot@broadcom.com>
Co-authored-by: Lyndon-Li <lyonghui@vmware.com>
Co-authored-by: Daniel Jiang <daniel.jiang@broadcom.com>
priyansh17 added a commit to priyansh17/velero that referenced this pull request Apr 16, 2026
…ion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
priyansh17 added a commit to priyansh17/velero that referenced this pull request Apr 16, 2026
…ion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
priyansh17 added a commit to priyansh17/velero that referenced this pull request Apr 16, 2026
…ion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
blackpiglet pushed a commit to blackpiglet/velero that referenced this pull request Apr 16, 2026
* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* fix: backup deletion silently succeeds when tarball download fails (velero-io#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* irregular volume size

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* Update the "community" page of website (velero-io#9722)

Update the community page to add the correct links to community meeting
and meeting notes.
I also removed the referece of google group as I confirmed the last
message was sent 2 years ago.

Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

---------

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Co-authored-by: Priyansh Choudhary <im1706@gmail.com>
Co-authored-by: nolanemirot <nolan.emirot@broadcom.com>
Co-authored-by: Lyndon-Li <lyonghui@vmware.com>
Co-authored-by: Daniel Jiang <daniel.jiang@broadcom.com>
kaovilai pushed a commit that referenced this pull request Apr 16, 2026
* fix: backup deletion silently succeeds when tarball download fails (#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Replace t.context() with context.TODO() for older go versions
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

---------

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
kaovilai pushed a commit that referenced this pull request Apr 16, 2026
* fix: backup deletion silently succeeds when tarball download fails (#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

---------

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
priyansh17 added a commit to priyansh17/velero that referenced this pull request Apr 23, 2026
* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* fix: backup deletion silently succeeds when tarball download fails (velero-io#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (velero-io#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* irregular volume size

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* block data mover design

Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* Update the "community" page of website (velero-io#9722)

Update the community page to add the correct links to community meeting
and meeting notes.
I also removed the referece of google group as I confirmed the last
message was sent 2 years ago.

Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>

* perf: better string concatenation

Signed-off-by: emirot <emirot.nolan@gmail.com>

---------

Signed-off-by: emirot <emirot.nolan@gmail.com>
Signed-off-by: nolanemirot <nolan.emirot@broadcom.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
Signed-off-by: Daniel Jiang <daniel.jiang@broadcom.com>
Co-authored-by: Priyansh Choudhary <im1706@gmail.com>
Co-authored-by: nolanemirot <nolan.emirot@broadcom.com>
Co-authored-by: Lyndon-Li <lyonghui@vmware.com>
Co-authored-by: Daniel Jiang <daniel.jiang@broadcom.com>
kaovilai pushed a commit that referenced this pull request Apr 23, 2026
* fix: backup deletion silently succeeds when tarball download fails (#9693)

* Enhance backup deletion logic to handle tarball download failures and clean up associated CSI VolumeSnapshotContents
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor error handling in backup deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Refactor backup deletion logic to skip CSI snapshot cleanup on tarball download failure
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* prevent backup deletion when errors occur
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* added logger
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Add delay to avoid race conditions during VolumeSnapshotContent deletion (#9700)

* Add delay to avoid race conditions during VolumeSnapshotContent deletion
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* updated changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Updated Changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Update changelog
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* fix: correct typo in comment regarding excluded volumes in TestGetVolumesByPod
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* Pin the sigs.k8s.io/controller-runtime to v0.23.2

The tag used to latest. Due to latest tag v0.23.3 already used
Golang v1.26, Velero main still uses v1.25. Build failed.
To fix this, pin the controller-runtime to v0.23.2

Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

* fix: update setup-envtest version in Dockerfile
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>

---------

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
Co-authored-by: Xun Jiang <xun.jiang@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Temp VolumeSnapshotContent objects stuck in deletion during backup cleanup CSI external-snapshotter informer race on versions < 1.33

5 participants