-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Re-recruit backup workers to avoid recoveries #12564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
06720ba to
4ab570c
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
The idea is that backup workers are stateless roles that can reconstruct its state after failures. So we don't need to trigger recoveries unnecessarily.
By removing an unused parameter.
Before fully recovered, there are backup workers for the old generations. The APIs for tracking these old generations and updating them are not well defined. To simplify the work, I'll limit the monitoring to the current generation only. Before fully recovered, any backup worker failures will cause another recovery, i.e., old behavior.
During normal operation, backup workers periodically call saveProgress() to save
their progress to the database using key backupProgressKeyFor(workerID) with
value containing {epoch, version, tag, totalTags}. So for the re-recruited one,
let the new worker to take over old worker's progress and use that as its start
version.
20251119-040547-jzhou-78a15d307baeed3f
20251119-043027-jzhou-94321348b7ca1c02
It was running infinite loops because no backup workers to monitor. Change so that the function returns Never() in such cases. 20251119-050212-jzhou-c3691b1ed4c330a3
When there is a new recovery, let the recovery process to handle initial backup worker recruitment. 20251119-054841-jzhou-1e367b6d83439c39
There could be a new recovery, thus the worker is popping from a previous epoch. As a result, this assertion doesn't make sense. The pop() is always safe to do because mutations are saved already. 20251119-173704-jzhou-31139c54035a3755
20251218-041558-jzhou-bbf8243bea1e71d0
Backup worker could be hanging around for too long and trigger assertions. 20251218-214722-jzhou-b9547780971cc150
20260105-220908-jzhou-57c53cd7ff2b8c1a compressed=True data_size=34250222 duration=5755330 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:57:34 sanity=False started=100000 stopped=20260105-230642 submitted=20260105-220908 timeout=5400 username=jzhou
4ab570c to
142029b
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
type has no linkage [-Werror=subobject-linkage]
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
| double CC_RERECRUIT_BACKUP_WORKER_TIMEOUT; | ||
| bool CC_RERECRUIT_BACKUP_WORKER_ENABLED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments describing the semantics of these knobs as well as the knob above (BACKUP_WORKER_REPLACEMENT_GRACE_PERIOD)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
| init( CC_RERECRUIT_LOG_ROUTER_TIMEOUT, 5.0 ); | ||
| init( CC_RERECRUIT_LOG_ROUTER_ENABLED, true ); if ( randomize && BUGGIFY ) CC_RERECRUIT_LOG_ROUTER_ENABLED = deterministicRandom()->coinflip(); | ||
| init( CC_RERECRUIT_BACKUP_WORKER_TIMEOUT, 5.0 ); | ||
| init( CC_RERECRUIT_BACKUP_WORKER_ENABLED, true ); if ( randomize && BUGGIFY ) CC_RERECRUIT_BACKUP_WORKER_ENABLED = deterministicRandom()->coinflip(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default of true is ok but just wanted to make sure this is an intentional choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intentional.
|
Convert to draft and will split this into multiple PRs. |
Similar to #12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor
monitorAndRecruitBackupWorkerstakes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the originalTagPartitionedLogSystem::onError_internal()actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability.Details of the implementation:
During normal operation: Backup workers periodically call
saveProgress()to save their progress to the database using keybackupProgressKeyFor(workerID)with value containing{epoch, version, tag, totalTags}.When a backup worker fails:
New backup worker starts:
Refactor log router and backup worker re-recruitment so that they share the same
monitorAndRecruitWorkerSet()logic for monitoring and re-recruiting after fully_recovered.This feature is controlled by knob
CC_RERECRUIT_BACKUP_WORKER_ENABLED, default is on for more test coverage.20260108-005523-jzhou-2753e9ae122ef17e compressed=True data_size=34255231 duration=7047094 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:39 sanity=False started=100000 stopped=20260108-015502 submitted=20260108-005523 timeout=5400 username=jzhou
20260108-033820-jzhou-0107822af36074ff compressed=True data_size=34255687 duration=5578744 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:53:28 sanity=False started=100000 stopped=20260108-043148 submitted=20260108-033820 timeout=5400 username=jzho
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)