[lighthouse] use heartbeat info to quickly drop down replicas #35
Open
Description
We currently fallback to the slow quorum path if any replicas fail. We have heartbeat information from all replicas so we should instead use that to detect which replicas are healthy and not wait for them.
https://github.com/pytorch-labs/torchft/blob/main/src/lighthouse.rs#L386
The heartbeat threshold should be configurable though currently we heartbeat every 100ms so 1s timeout seems fine.
We may also want to extract the quorum algorithm into separate configurable/plugable strategies so we can switch between the old and the new logic.
Relevant existing tests: