Skip to content

[lighthouse] use heartbeat info to quickly drop down replicas #35

Open
@d4l3k

Description

We currently fallback to the slow quorum path if any replicas fail. We have heartbeat information from all replicas so we should instead use that to detect which replicas are healthy and not wait for them.

https://github.com/pytorch-labs/torchft/blob/main/src/lighthouse.rs#L386

The heartbeat threshold should be configurable though currently we heartbeat every 100ms so 1s timeout seems fine.

We may also want to extract the quorum algorithm into separate configurable/plugable strategies so we can switch between the old and the new logic.

Relevant existing tests:

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlighthouseLighthouse and quorum related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions