Skip to content

fix: do not restore PV verbatim when volume data was skipped via VolumePolicy#9588

Open
mateenali66 wants to merge 5 commits intovelero-io:mainfrom
mateenali66:fix/skip-volume-policy-pv-restore
Open

fix: do not restore PV verbatim when volume data was skipped via VolumePolicy#9588
mateenali66 wants to merge 5 commits intovelero-io:mainfrom
mateenali66:fix/skip-volume-policy-pv-restore

Conversation

@mateenali66
Copy link
Copy Markdown

Summary

Fixes #9318

When a VolumePolicy action=skip is used during backup, the volume data is intentionally not backed up. This PR adds defense-in-depth to prevent restoring the original PV identity (VolumeHandle) when there's no snapshot data to back it:

  • pv_restorer.go: When snapshotInfo == nil and the PV has a Delete reclaim policy, returns errPVNeedsReprovisioning instead of the PV object. This prevents restoring a PV whose underlying storage may no longer exist.
  • restore.go: Both the new (BackupVolumeInfo) and legacy snapshot paths in handlePVHasNativeSnapshot now catch errPVNeedsReprovisioning and trigger dynamic re-provisioning via pvsToProvision.
  • handleSkippedPVHasRetainPolicy: Added a warning log about cross-cluster restore risks when restoring a PV with its original volume identity. The Retain policy case preserves existing behavior (restoring as-is) since changing it would be a breaking change.

Why this is safe

  • The Delete reclaim policy + no snapshot case was already handled correctly in the main restore flow (restoreItem). This change adds a safety net in executePVAction for defense-in-depth.
  • The Retain policy behavior is unchanged - only a warning is added. Per maintainer feedback, this is acceptable for now.
  • Existing tests all pass, and new test cases cover both Delete and Retain reclaim policy scenarios.

Test plan

  • Added test: no snapshot and Delete reclaim policy: return errPVNeedsReprovisioning
  • Added test: no snapshot and Retain reclaim policy: return PV as-is
  • All existing TestExecutePVAction_NoSnapshotRestores tests pass
  • All existing TestExecutePVAction_SnapshotRestores tests pass
  • Package compiles without errors

@github-actions github-actions Bot requested review from reasonerjt and sseago March 8, 2026 18:03
@mateenali66 mateenali66 force-pushed the fix/skip-volume-policy-pv-restore branch from 1423c20 to 3e1a0b6 Compare March 8, 2026 18:13
@Lyndon-Li
Copy link
Copy Markdown
Contributor

From the fist glance of #9318, I regard it as an expected behavior:

  • action=skip means the volume backup won't go with either CSI snapshot or fs-backup
  • On the other hand, Velero has another backup method --- native snapshot which is actually the default backup method
  • So in this case, the backup fallbacks to native snapshot
  • And during restore, the result matches to the behavior of native snapshot

So the point is, action=skip is not skip backing up the volume.

@mateenali66
Copy link
Copy Markdown
Author

Thanks @Lyndon-Li — that's a helpful clarification. If action=skip is a fallback to native snapshot rather than a full "skip all backup methods" directive, then the restore behavior is working as designed and this PR is based on a misread of the semantics.

Closing this. The original issue (#9318) may be worth revisiting from a documentation or UX angle — it seems the reporters expected skip to mean "don't backup data at all", which suggests the distinction between skip-CSI/fs-backup vs. skip-all isn't obvious. But that's a separate conversation for the issue thread.

Thanks for the quick review.

@mateenali66 mateenali66 closed this Mar 9, 2026
@Lyndon-Li Lyndon-Li reopened this Mar 9, 2026
@Lyndon-Li
Copy link
Copy Markdown
Contributor

@mateenali66
Sorry, my mistake. The volume policy should include native snapshot, so for action=skip it should skip any volume backup, including volume snapshot.
And the current behavior is like this because volume policy only defines the behavior of volume backup, so even though no volume data is backed up, the PVC/PV resource are still backed up. As a result, during restore, they are restored as is.

I reopened your PR.

@mateenali66 mateenali66 reopened this Mar 14, 2026
@mateenali66
Copy link
Copy Markdown
Author

Hi, wanted to follow up on this. @Lyndon-Li confirmed in the comments that the current behavior is a bug and that action=skip should skip all volume backup methods including native snapshot. The fix addresses exactly that. Would appreciate a review when you get a chance. Happy to add tests or adjust the approach if needed.

@kaovilai
Copy link
Copy Markdown
Collaborator

Hi, fyi that some maintainers are at Kubecon EU this week.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/restore/restore.go 73.33% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@blackpiglet
Copy link
Copy Markdown
Contributor

@mateenali66
Please create a changelog file.
https://velero.io/docs/main/code-standards/#adding-a-changelog

mateenali66 added a commit to mateenali66/velero that referenced this pull request Mar 26, 2026
…og entry

- Add TestRestorePVWithVolumeInfo case: NativeSnapshot PV with Delete
  reclaim policy and no snapshot data triggers dynamic re-provisioning
- Add TestHandlePVHasNativeSnapshot_PropagatesErrPVNeedsReprovisioning
  to cover the errPVNeedsReprovisioning propagation in handlePVHasNativeSnapshot
- Add changelog entry for PR velero-io#9588

Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
@blackpiglet blackpiglet force-pushed the fix/skip-volume-policy-pv-restore branch from 61851eb to e0e4fc1 Compare April 1, 2026 13:51
blackpiglet pushed a commit to mateenali66/velero that referenced this pull request Apr 1, 2026
…og entry

- Add TestRestorePVWithVolumeInfo case: NativeSnapshot PV with Delete
  reclaim policy and no snapshot data triggers dynamic re-provisioning
- Add TestHandlePVHasNativeSnapshot_PropagatesErrPVNeedsReprovisioning
  to cover the errPVNeedsReprovisioning propagation in handlePVHasNativeSnapshot
- Add changelog entry for PR velero-io#9588

Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
@blackpiglet
Copy link
Copy Markdown
Contributor

@mateenali66
Change looks good, but CI action had an error. Could you help take a look?

blackpiglet pushed a commit to mateenali66/velero that referenced this pull request Apr 2, 2026
…og entry

- Add TestRestorePVWithVolumeInfo case: NativeSnapshot PV with Delete
  reclaim policy and no snapshot data triggers dynamic re-provisioning
- Add TestHandlePVHasNativeSnapshot_PropagatesErrPVNeedsReprovisioning
  to cover the errPVNeedsReprovisioning propagation in handlePVHasNativeSnapshot
- Add changelog entry for PR velero-io#9588

Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
@blackpiglet blackpiglet force-pushed the fix/skip-volume-policy-pv-restore branch from e0e4fc1 to 26eb929 Compare April 2, 2026 03:01
@mateenali66
Copy link
Copy Markdown
Author

@kaovilai @blackpiglet gentle reminder, CI is green now, this can be approved/merged whenever you feel easy

…mePolicy

When a VolumePolicy action=skip is used during backup, the volume data
is intentionally not backed up. During restore, Velero was restoring
the original PV identity (same VolumeHandle) which is dangerous:
- With Delete reclaim policy: underlying storage may no longer exist
- In cross-cluster restore: two clusters would share the same storage

This adds defense-in-depth in pv_restorer.go: when no snapshot is found
and the PV has a Delete reclaim policy, return errPVNeedsReprovisioning
to trigger dynamic re-provisioning instead of restoring the PV as-is.
The callers in restore.go (both new and legacy paths) handle this error
by inserting the PV into pvsToProvision.

For Retain reclaim policy PVs without snapshots, a warning is now logged
about the risks of cross-cluster restore scenarios.

Fixes velero-io#9318

Signed-off-by: Mateen Ali Anjum <mateenali66@gmail.com>
Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
…og entry

- Add TestRestorePVWithVolumeInfo case: NativeSnapshot PV with Delete
  reclaim policy and no snapshot data triggers dynamic re-provisioning
- Add TestHandlePVHasNativeSnapshot_PropagatesErrPVNeedsReprovisioning
  to cover the errPVNeedsReprovisioning propagation in handlePVHasNativeSnapshot
- Add changelog entry for PR velero-io#9588

Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
Signed-off-by: Mateen Anjum <mateenali66@gmail.com>
@blackpiglet blackpiglet force-pushed the fix/skip-volume-policy-pv-restore branch from 039038d to dd7dcc1 Compare April 14, 2026 08:11
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@kaovilai kaovilai disabled auto-merge April 14, 2026 19:27
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@kaovilai kaovilai disabled auto-merge April 14, 2026 19:27
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@kaovilai kaovilai disabled auto-merge April 14, 2026 19:27
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@kaovilai kaovilai disabled auto-merge April 14, 2026 19:27
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@kaovilai kaovilai disabled auto-merge April 14, 2026 19:27
@kaovilai kaovilai enabled auto-merge (squash) April 14, 2026 19:27
@mateenali66
Copy link
Copy Markdown
Author

@kaovilai when you get a moment, this one's been sitting since @blackpiglet approved on 4/1. would appreciate a second review if you're happy with the approach. CI green, coverage 77.8%, changelog in. fix addresses the PV restore-verbatim bug that @Lyndon-Li confirmed in #9588 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Behavior of skip volume policy is strange/problematic

4 participants