-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Mitigate rocksdb external timeout in AtomicBackupToDBCorrectness test #12187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
fdbserver/storageserver.actor.cpp
Outdated
if ((!unlimitedCommitBytes && bytesLeft <= 0) || clearRangesLeft <= 0) | ||
if (!unlimitedCommitBytes && (bytesLeft <= 0 || clearRangesLeft <= 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the timeout issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug fix, where the bug presents given the fix of timeout issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in the case of unlimitedCommitBytes is enabled, I did not want to commit, if the commit had 0 clearRanges left. I did not get the intention here of changing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix, this assertion is broken: ASSERT(newOldestVersion == data->pendingAddRanges.begin()->first);
Looks like this broken stuff is ShardedRocksDB specific.
@yao-xiao-github Can you help to explain a bit about intention of this fix? Thanks!
If I understand correctly, a SS has to do makeVersionMutationsDurable
if the SS has pendingAddRanges
and the version is at most desiredVersion to make sure the private mutation associated with the
addRange
are committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neethuhaneesha Maybe you want to always set unlimitedCommitBytes=false
before commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clearRanges
is a commit because a range clear mutation is reducing the bytesLeft
. So, unlimitedCommitBytes
should disable clearRangesLeft
according to the variable name.
unlimitedCommitBytes is set to true only when pendingAddRanges is not empty, which is ShardedRocksDB specific. So, the unlimitedCommitBytes is always false for RocksDB. @neethuhaneesha
Is pendingAddRanges always empty for non-ShardedRocksDB engine? @yao-xiao-github
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like unlimitedCommitBytes is true only if you have pendingAddRanges(shared-roksdb only). This should be fine. Discussed offline and decided to add some trace events.
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
if (!unlimitedCommitBytes && (bytesLeft <= 0 || clearRangesLeft <= 0)) | ||
return true; | ||
|
||
if (clearRangesLeft <= 0 && verbose) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want to add trace event overhead in this busy code in general. So, we do this verbose tracing for rocksdb only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a counter to count # of commits split caused by insufficient clearRangesLeft. I suspect this number is really low in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this counter to the next PR.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
Please cherry-pick to release-7.3 branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this
…apple#12187) * fix rocksdb external timeout * address comments
100K correctness:
20250610-164440-zhewang-032e866a85878b19 compressed=True data_size=41255833 duration=5387342 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=3:29:27 sanity=False started=100000 stopped=20250610-201407 submitted=20250610-164440 timeout=5400 username=zhewang
100K AtomicBackupToDBCorrectness:
20250610-212947-zhewang-6234e861a99c8ee9 compressed=True data_size=41296380 duration=265406 ended=2401 fail_fast=10 max_runs=100000 pass=2401 priority=100 remaining=1 day, 14:10:35 runtime=0:56:21 sanity=False started=2576 submitted=20250610-212947 timeout=5400 username=zhewang
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)