-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lighthouse/quorum: avoid split brain and add shrink_only support #71
Conversation
healthy_participants.len(), | ||
opt.min_replicas, | ||
metadata | ||
), | ||
); | ||
} | ||
|
||
// Avoid split brain by requiring at least half of the known alive workers. | ||
if healthy_participants.len() <= healthy_replicas.len() / 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you not get split brain if you have equal sizes of healthy workers? E.g. healthy_participants= 5, healthy_replicas= 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check blocks the quorum so for the positive case we'd end up with
if 5 <= 10 / 2 {
return (None, ...)
}
so we would need a quorum of minimum size 6/10 to bypass this check
I would think its ok. The fast quorum case is now covered by having no heartbeating replica that hasnt joined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it looks good, just had a few questions
.filter(|p| prev_replica_ids.contains(&p.replica_id)) | ||
.collect(); | ||
} | ||
|
||
// Fast quorum is when all previous participants are still in the quorum | ||
// and we have enough participants to form a quorum. | ||
let is_fast_quorum = prev_quorum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of prev_quorum, should this check candidate_participants
are all in prev_replica_ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fast quorum we do want to allow it to grow which is why we don't check that all candidate_participants are in prev_replica_ids. There could be new candidates
I may also just delete this given the changes to join timeout below
@@ -138,16 +141,35 @@ fn quorum_compute( | |||
// Sort by replica ID to get a consistent ordering across runs. | |||
candidate_participants.sort_by_key(|p| p.replica_id.clone()); | |||
|
|||
let shrink_only = healthy_participants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect to have only some members have shrink_only=True
and some shrink_only=False
? I noticed we take any
but should all the members be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an expected case -- in LocalSGD we could have members join at any time and be in the wrong spot in the training loop. Members in the right location should already be in the quorum. In that case, the members not participating will get an exception and it's up to the caller to handle this correctly
Changes for a follow up PR:
|
This modifies the quorum behavior with a few major changes:
Rational: in scale up scenarios we could end up in split brain cases where min_replicas=2, but total replicas = 10 and we end up flip/flopping the quorum with effectively two separate jobs
Rational: since heartbeating starts when manager starts (often before model init), there's not much use in waiting out the full time since we probably won't have any workers join and we have at least min_replicas
(TODO: should we remove fast quorum as we can now complete quorum without waiting for the entire join timeout?)
shrink_only
support which will only allow workers to drop out but no new ones to joinRational: this is intended to support long step times such as needed in LocalSGD
Test plan:
Updated/added quorum test cases and added asserts on the explainations