Skip to content

Mitigate rocksdb external timeout in AtomicBackupToDBCorrectness test #12187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 10, 2025

Conversation

kakaiu
Copy link
Member

@kakaiu kakaiu commented Jun 9, 2025

100K correctness:
20250610-164440-zhewang-032e866a85878b19 compressed=True data_size=41255833 duration=5387342 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=3:29:27 sanity=False started=100000 stopped=20250610-201407 submitted=20250610-164440 timeout=5400 username=zhewang

100K AtomicBackupToDBCorrectness:
20250610-212947-zhewang-6234e861a99c8ee9 compressed=True data_size=41296380 duration=265406 ended=2401 fail_fast=10 max_runs=100000 pass=2401 priority=100 remaining=1 day, 14:10:35 runtime=0:56:21 sanity=False started=2576 submitted=20250610-212947 timeout=5400 username=zhewang

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 0:25:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: eb10bcb
  • Duration 0:38:20
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: eb10bcb
  • Duration 0:48:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: eb10bcb
  • Duration 0:59:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 1:02:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

yao-xiao-github
yao-xiao-github previously approved these changes Jun 9, 2025
@kakaiu kakaiu requested a review from neethuhaneesha June 9, 2025 17:47
Comment on lines 13584 to 13585
if ((!unlimitedCommitBytes && bytesLeft <= 0) || clearRangesLeft <= 0)
if (!unlimitedCommitBytes && (bytesLeft <= 0 || clearRangesLeft <= 0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the timeout issue?

Copy link
Member Author

@kakaiu kakaiu Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix, where the bug presents given the fix of timeout issue

Copy link
Contributor

@neethuhaneesha neethuhaneesha Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the case of unlimitedCommitBytes is enabled, I did not want to commit, if the commit had 0 clearRanges left. I did not get the intention here of changing this.

Copy link
Member Author

@kakaiu kakaiu Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, this assertion is broken: ASSERT(newOldestVersion == data->pendingAddRanges.begin()->first); Looks like this broken stuff is ShardedRocksDB specific.
@yao-xiao-github Can you help to explain a bit about intention of this fix? Thanks!

If I understand correctly, a SS has to do makeVersionMutationsDurable if the SS has pendingAddRanges and the version is at most desiredVersion to make sure the private mutation associated with the
addRange are committed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neethuhaneesha Maybe you want to always set unlimitedCommitBytes=false before commit?

Copy link
Member Author

@kakaiu kakaiu Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clearRanges is a commit because a range clear mutation is reducing the bytesLeft. So, unlimitedCommitBytes should disable clearRangesLeft according to the variable name.

unlimitedCommitBytes is set to true only when pendingAddRanges is not empty, which is ShardedRocksDB specific. So, the unlimitedCommitBytes is always false for RocksDB. @neethuhaneesha

Is pendingAddRanges always empty for non-ShardedRocksDB engine? @yao-xiao-github

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unlimitedCommitBytes is true only if you have pendingAddRanges(shared-roksdb only). This should be fine. Discussed offline and decided to add some trace events.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 1:40:42
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 1:43:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@kakaiu kakaiu requested a review from spraza June 9, 2025 18:39
@kakaiu kakaiu closed this Jun 9, 2025
@kakaiu kakaiu reopened this Jun 9, 2025
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 0:26:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: eb10bcb
  • Duration 0:38:48
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: eb10bcb
  • Duration 0:48:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 0:50:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: eb10bcb
  • Duration 0:59:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 1:00:52
  • Result: ❌ FAILED
  • Error: Error while executing command: TEST_USERNAME=fdb-pr-${CODEBUILD_BUILD_NUMBER} make -kj -C tests foundationdb-pr-tests. Reason: exit status 2
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: eb10bcb
  • Duration 1:09:00
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

if (!unlimitedCommitBytes && (bytesLeft <= 0 || clearRangesLeft <= 0))
return true;

if (clearRangesLeft <= 0 && verbose) {
Copy link
Member Author

@kakaiu kakaiu Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to add trace event overhead in this busy code in general. So, we do this verbose tracing for rocksdb only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a counter to count # of commits split caused by insufficient clearRangesLeft. I suspect this number is really low in prod.

Copy link
Member Author

@kakaiu kakaiu Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this counter to the next PR.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 1c6bbe7
  • Duration 0:24:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 1c6bbe7
  • Duration 0:39:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 1c6bbe7
  • Duration 0:47:56
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 1c6bbe7
  • Duration 0:59:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 1c6bbe7
  • Duration 1:00:40
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 1c6bbe7
  • Duration 1:01:56
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 1c6bbe7
  • Duration 1:02:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Contributor

@neethuhaneesha neethuhaneesha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

Please cherry-pick to release-7.3 branch.

Copy link
Collaborator

@spraza spraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this

@kakaiu kakaiu merged commit 42dd9ef into apple:main Jun 10, 2025
7 checks passed
kakaiu added a commit to kakaiu/foundationdb that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants