Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lighthouse/quorum: avoid split brain and add shrink_only support #71

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jan 14, 2025

This modifies the quorum behavior with a few major changes:

  1. the quorum won't complete unless at least half of the heartbeating replicas join the quorum (this is to avoid split brain conditions) except in fast quorum cases when we're growing the quorum
    Rational: in scale up scenarios we could end up in split brain cases where min_replicas=2, but total replicas = 10 and we end up flip/flopping the quorum with effectively two separate jobs
  2. the join_timeout now only applies if there's a heartbeating replica which hasn't joined the quorum
    Rational: since heartbeating starts when manager starts (often before model init), there's not much use in waiting out the full time since we probably won't have any workers join and we have at least min_replicas
    (TODO: should we remove fast quorum as we can now complete quorum without waiting for the entire join timeout?)
  3. added shrink_only support which will only allow workers to drop out but no new ones to join
    Rational: this is intended to support long step times such as needed in LocalSGD

Test plan:

Updated/added quorum test cases and added asserts on the explainations

cargo test

@d4l3k d4l3k requested review from H-Huang and Jackmin801 January 14, 2025 01:43
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 14, 2025
healthy_participants.len(),
opt.min_replicas,
metadata
),
);
}

// Avoid split brain by requiring at least half of the known alive workers.
if healthy_participants.len() <= healthy_replicas.len() / 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you not get split brain if you have equal sizes of healthy workers? E.g. healthy_participants= 5, healthy_replicas= 10.

Copy link
Member Author

@d4l3k d4l3k Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check blocks the quorum so for the positive case we'd end up with

if 5 <= 10 / 2 {
    return (None, ...)
}

so we would need a quorum of minimum size 6/10 to bypass this check

@Jackmin801
Copy link
Collaborator

should we remove fast quorum as we can now complete quorum without waiting for the entire join timeout?

I would think its ok. The fast quorum case is now covered by having no heartbeating replica that hasnt joined

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it looks good, just had a few questions

.filter(|p| prev_replica_ids.contains(&p.replica_id))
.collect();
}

// Fast quorum is when all previous participants are still in the quorum
// and we have enough participants to form a quorum.
let is_fast_quorum = prev_quorum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of prev_quorum, should this check candidate_participants are all in prev_replica_ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fast quorum we do want to allow it to grow which is why we don't check that all candidate_participants are in prev_replica_ids. There could be new candidates

I may also just delete this given the changes to join timeout below

@@ -138,16 +141,35 @@ fn quorum_compute(
// Sort by replica ID to get a consistent ordering across runs.
candidate_participants.sort_by_key(|p| p.replica_id.clone());

let shrink_only = healthy_participants
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect to have only some members have shrink_only=True and some shrink_only=False? I noticed we take any but should all the members be consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an expected case -- in LocalSGD we could have members join at any time and be in the wrong spot in the training loop. Members in the right location should already be in the quorum. In that case, the members not participating will get an exception and it's up to the caller to handle this correctly

@d4l3k d4l3k merged commit 79572e6 into main Jan 15, 2025
6 checks passed
@d4l3k d4l3k deleted the d4l3k/shrink_only branch January 15, 2025 00:01
@d4l3k
Copy link
Member Author

d4l3k commented Jan 15, 2025

Changes for a follow up PR:

  • automatic retries for the not included in quorum case up to the specified timeout
  • set per RPC timeouts using tokio::time::timeout_at
  • enable TCP keep alives
  • delete fast quorum?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants