Skip to content

Conversation

@PhantomInTheWire
Copy link

@PhantomInTheWire PhantomInTheWire commented Jan 4, 2026

fixes: #18055

Approach

OpenSnapshotBackend now calls ReserveDBSnapshot to mark the snapshot index as "in-use" before attempting to access the file. Concurrent calls to ReleaseSnapDBs (triggered by incoming newer snapshots) check this reservation map and explicitly skip deletion for any reserved indices. This guarantees that the snapshot file currently being applied is protected from deletion, even if a newer snapshot arrives and triggers a cleanup during the apply process.

This PR implements an in-memory reservation mechanism rather than file locking because:

  1. Both the apply path OpenSnapshotBackend and the cleanup path ReleaseSnapDBs operate on the same *Snapshotter instance, allowing in-memory coordination.

  2. File locking (flock) behaves differently across platforms (Linux, Windows) and filesystems (NFS), while in-memory coordination is consistent everywhere. (contributing.md mentions only Linux is supported, but the Makefile does include Windows builds.)

  3. OpenSnapshotBackend renames the snapshot file; it is my understanding that file locks do not survive renames on many systems.

  4. The reservation is a simple map lookup protected by RWMutex, with Reserve/Release co-located in one function using defer.

The fix adds a reserved map to track snapshots currently being applied, and ReleaseSnapDBs skips deletion of any reserved snapshots.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PhantomInTheWire
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @PhantomInTheWire. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

serathius commented Jan 5, 2026

/ok-to-test

Can you provide description why do you think this PR addresses the issue? The #18055 (comment) describes different solution based on file locking, not file locking. Not saying it's wrong, but it would be good to have some overview of how you assume snapshotter works, and how this problem occurs, and why your approach solves it.

If my understanding is correct, @ahrtr proposed to use file locking to prevent deletion of snapshot because there is no direct communication between snapshotter and apply loop using snapshots and mechanism that cleanups snapshots. The only way they communicate is via filesystem, like locking file.

@PhantomInTheWire
Copy link
Author

/retest

@codecov
Copy link

codecov bot commented Jan 5, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.59%. Comparing base (61d54cb) to head (008aeb2).

Files with missing lines Patch % Lines
server/etcdserver/api/snap/snapshotter.go 77.27% 3 Missing and 2 partials ⚠️
server/storage/backend.go 80.00% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/backend.go 81.03% <80.00%> (+1.03%) ⬆️
server/etcdserver/api/snap/snapshotter.go 71.34% <77.27%> (+0.99%) ⬆️

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21082      +/-   ##
==========================================
+ Coverage   68.39%   68.59%   +0.19%     
==========================================
  Files         429      429              
  Lines       35281    35303      +22     
==========================================
+ Hits        24132    24217      +85     
+ Misses       9742     9693      -49     
+ Partials     1407     1393      -14     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61d54cb...008aeb2. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PhantomInTheWire
Copy link
Author

hey @serathius ive updated the pr desc with you what you asked, and why i did not use file locking as suggested by @ahrtr

@serathius
Copy link
Member

@ahrtr can you take a look as author of the proposal? #18055 (comment)

@PhantomInTheWire
Copy link
Author

rebased to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

panic when two snapshots are received in short period

3 participants