You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR fixes a storage-server recovery bug where an in-flight update actor could keep reading from a stale tlog peek cursor after dbInfoChange installed a new logCursor. Previously a storage server could remain alive but permanently behind, so later private mutations like serverKeys assignments were never applied and data moves could hang forever in FinishMoveShards.
The fix adds explicit cursor invalidation on dbInfoChange and teaches update to exit cleanly if its captured cursor is no longer current, so the next update run always binds to the new recovery-generation cursor.
This bug was detected in CI for an unrelated PR (#12799):
Commit: 86eafa66d3edc20684599d7f916c9c9b8b7c4f0d
Test: tests/slow/StorefrontTest.toml
Seed: 1933528853
Buggify: Enabled
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)
It looks like it's been present since the repo's inception, but maybe made more likely by recent changes. The bug is rare, it requires the following race:
update starts, finishing all ILogSystem::IPeekCursor::getMore calls
While this same update coroutine is running, a cluster recovery causes StorageServer::logCursor to be initialized at at v = version.get() + 1
update advances cloneCursor2's version to v2, which is greater than v
update sets version to v2 - 1, but this is after the new logCursor has already been created
data->logCursor->advanceTo(cloneCursor2->version()) advances the new logCursor from version v to v2, skipping over mutations from the new logCursor
The old comment's warning:
If update() is waiting for results from the tlog, it might never get them, so needs to be cancelled. But if it is waiting later, cancelling it could cause problems (e.g. fetchKeys that already committed to transitioning to waiting state)
is valid, but the old implementation was too coarse, because it allowed even the ILogSystem::IPeekCursor::advanceTo call to proceed as long as the update coroutine received any mutations (even if logCursor had been replaced).
Build Workspace zip file of the working directory (available for 30 days)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a storage-server recovery bug where an in-flight
updateactor could keep reading from a stale tlog peek cursor afterdbInfoChangeinstalled a newlogCursor. Previously a storage server could remain alive but permanently behind, so later private mutations like serverKeys assignments were never applied and data moves could hang forever inFinishMoveShards.The fix adds explicit cursor invalidation on
dbInfoChangeand teachesupdateto exit cleanly if its captured cursor is no longer current, so the nextupdaterun always binds to the new recovery-generation cursor.This bug was detected in CI for an unrelated PR (#12799):
86eafa66d3edc20684599d7f916c9c9b8b7c4f0dtests/slow/StorefrontTest.toml1933528853Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)