Explicitly invalidate stale tlog peek cursors by tclinkenbeard-oai · Pull Request #12812 · apple/foundationdb

tclinkenbeard-oai · 2026-03-19T23:06:37Z

This PR fixes a storage-server recovery bug where an in-flight update actor could keep reading from a stale tlog peek cursor after dbInfoChange installed a new logCursor. Previously a storage server could remain alive but permanently behind, so later private mutations like serverKeys assignments were never applied and data moves could hang forever in FinishMoveShards.

The fix adds explicit cursor invalidation on dbInfoChange and teaches update to exit cleanly if its captured cursor is no longer current, so the next update run always binds to the new recovery-generation cursor.

This bug was detected in CI for an unrelated PR (#12799):

Commit: 86eafa66d3edc20684599d7f916c9c9b8b7c4f0d
Test: tests/slow/StorefrontTest.toml
Seed: 1933528853
Buggify: Enabled

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2026-03-19T23:17:54Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 178ac8c
Duration 0:11:03
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

gxglass · 2026-03-19T23:19:30Z

Can you tell how long this bug has been present?

foundationdb-ci · 2026-03-19T23:19:55Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 178ac8c
Duration 0:13:04
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-19T23:30:45Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 178ac8c
Duration 0:23:54
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

gxglass · 2026-03-19T23:39:17Z

@sbodagala could you review this PR?

foundationdb-ci · 2026-03-19T23:41:04Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 178ac8c
Duration 0:34:16
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-19T23:41:48Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 178ac8c
Duration 0:35:00
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-03-19T23:42:20Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 178ac8c
Duration 0:35:31
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-19T23:43:40Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 178ac8c
Duration 0:36:48
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-20T00:39:09Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 3038daf
Duration 0:30:52
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

tclinkenbeard-oai · 2026-03-20T01:18:56Z

Can you tell how long this bug has been present?

It looks like it's been present since the repo's inception, but maybe made more likely by recent changes. The bug is rare, it requires the following race:

update starts, finishing all ILogSystem::IPeekCursor::getMore calls
While this same update coroutine is running, a cluster recovery causes StorageServer::logCursor to be initialized at at v = version.get() + 1
update advances cloneCursor2's version to v2, which is greater than v
update sets version to v2 - 1, but this is after the new logCursor has already been created
data->logCursor->advanceTo(cloneCursor2->version()) advances the new logCursor from version v to v2, skipping over mutations from the new logCursor

The old comment's warning:

If update() is waiting for results from the tlog, it might never get them, so needs to be cancelled. But if it is waiting later, cancelling it could cause problems (e.g. fetchKeys that already committed to transitioning to waiting state)

is valid, but the old implementation was too coarse, because it allowed even the ILogSystem::IPeekCursor::advanceTo call to proceed as long as the update coroutine received any mutations (even if logCursor had been replaced).

foundationdb-ci · 2026-03-20T01:44:36Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 3038daf
Duration 1:36:16
Result: ❌ FAILED
Error: Error while executing command: TEST_USERNAME=fdb-pr-${CODEBUILD_BUILD_NUMBER} make -kj -C tests foundationdb-pr-tests. Reason: exit status 2
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-03-20T03:08:57Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 3038daf
Duration 3:00:39
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-20T03:09:02Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 3038daf
Duration 3:00:46
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-03-20T03:09:17Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 3038daf
Duration 3:01:01
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

Explicitly invalidate stale tlog peek cursors

178ac8c

Fix compilation error from cherry-pick

3038daf

Conversation

tclinkenbeard-oai commented Mar 19, 2026

Code-Reviewer Section

For Release-Branches

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

gxglass commented Mar 19, 2026

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

gxglass commented Mar 19, 2026

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Mar 19, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Mar 20, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

tclinkenbeard-oai commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foundationdb-ci commented Mar 20, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Mar 20, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Mar 20, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Mar 20, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tclinkenbeard-oai commented Mar 20, 2026 •

edited

Loading