Skip to content

Explicitly invalidate stale tlog peek cursors#12812

Open
tclinkenbeard-oai wants to merge 2 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/invalidate-stale-cursor
Open

Explicitly invalidate stale tlog peek cursors#12812
tclinkenbeard-oai wants to merge 2 commits intoapple:mainfrom
tclinkenbeard-oai:dev/tclinkenbeard/invalidate-stale-cursor

Conversation

@tclinkenbeard-oai
Copy link
Collaborator

This PR fixes a storage-server recovery bug where an in-flight update actor could keep reading from a stale tlog peek cursor after dbInfoChange installed a new logCursor. Previously a storage server could remain alive but permanently behind, so later private mutations like serverKeys assignments were never applied and data moves could hang forever in FinishMoveShards.

The fix adds explicit cursor invalidation on dbInfoChange and teaches update to exit cleanly if its captured cursor is no longer current, so the next update run always binds to the new recovery-generation cursor.

This bug was detected in CI for an unrelated PR (#12799):

  • Commit: 86eafa66d3edc20684599d7f916c9c9b8b7c4f0d
  • Test: tests/slow/StorefrontTest.toml
  • Seed: 1933528853
  • Buggify: Enabled

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 178ac8c
  • Duration 0:11:03
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Contributor

gxglass commented Mar 19, 2026

Can you tell how long this bug has been present?

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 178ac8c
  • Duration 0:13:04
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 178ac8c
  • Duration 0:23:54
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Contributor

gxglass commented Mar 19, 2026

@sbodagala could you review this PR?

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 178ac8c
  • Duration 0:34:16
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 178ac8c
  • Duration 0:35:00
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 178ac8c
  • Duration 0:35:31
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 178ac8c
  • Duration 0:36:48
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 3038daf
  • Duration 0:30:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@tclinkenbeard-oai
Copy link
Collaborator Author

tclinkenbeard-oai commented Mar 20, 2026

Can you tell how long this bug has been present?

It looks like it's been present since the repo's inception, but maybe made more likely by recent changes. The bug is rare, it requires the following race:

  • update starts, finishing all ILogSystem::IPeekCursor::getMore calls
  • While this same update coroutine is running, a cluster recovery causes StorageServer::logCursor to be initialized at at v = version.get() + 1
  • update advances cloneCursor2's version to v2, which is greater than v
  • update sets version to v2 - 1, but this is after the new logCursor has already been created
  • data->logCursor->advanceTo(cloneCursor2->version()) advances the new logCursor from version v to v2, skipping over mutations from the new logCursor

The old comment's warning:

If update() is waiting for results from the tlog, it might never get them, so needs to be cancelled. But if it is waiting later, cancelling it could cause problems (e.g. fetchKeys that already committed to transitioning to waiting state)

is valid, but the old implementation was too coarse, because it allowed even the ILogSystem::IPeekCursor::advanceTo call to proceed as long as the update coroutine received any mutations (even if logCursor had been replaced).

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 3038daf
  • Duration 1:36:16
  • Result: ❌ FAILED
  • Error: Error while executing command: TEST_USERNAME=fdb-pr-${CODEBUILD_BUILD_NUMBER} make -kj -C tests foundationdb-pr-tests. Reason: exit status 2
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 3038daf
  • Duration 3:00:39
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 3038daf
  • Duration 3:00:46
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 3038daf
  • Duration 3:01:01
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants