Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordination: expose new low level torchft coordination API #84

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Jan 29, 2025

This exposes LighthouseServer, ManagerServer and ManagerClient so power users can directly call them to implement custom fault tolerance strategies.

It also renames some modules (_torchft for Rust lib) and methods to indicate which ones are private and subject to change.

This does not expose a quorum algorithm currently. The plan is to add a new "simple quorum" interface that allows for custom data to be passed between members instead of the data required by the Manager.

We may also want to expose a pluggable quorum algorithm that can be passed into LighthouseServer to customize that as well.

Test plan:

cd docs && make livehtml
pytest
cargo test

localhost_8001_coordination html(Surface Pro 7)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2025
@d4l3k d4l3k force-pushed the d4l3k/coordination branch from b4c8807 to 2cf6c4f Compare February 18, 2025 16:44
@d4l3k d4l3k force-pushed the d4l3k/coordination branch from 2cf6c4f to f2bd4d4 Compare February 18, 2025 16:47
@d4l3k d4l3k requested a review from H-Huang February 18, 2025 16:49
@d4l3k d4l3k added the lighthouse Lighthouse and quorum related label Feb 18, 2025
@d4l3k d4l3k marked this pull request as ready for review February 18, 2025 16:50
@d4l3k
Copy link
Member Author

d4l3k commented Feb 18, 2025

cc @b0noI

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! looks good

/// store_addr (str): The HTTP address of the store server.
/// world_size (int): The world size of the replica group.
/// heartbeat_interval (timedelta): The interval at which heartbeats are sent.
/// connect_timeout (timedelta): The timeout for connecting to the lighthouse server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is it the standard in rust for the documentation to be above the struct definition rather than the new method since the parameters for the constructor aren't shown here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments aren't really standard rust. This format is specifically so pyo3 will pull in the comments so they show up in the Python documentation

Added a screenshot to the test plan so you can see how it renders in sphinx

Sphinx doesn't render __init__ separately by default so it all gets dumped into the class doc string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks for the SS. Looks great!

@d4l3k d4l3k merged commit bf27956 into main Feb 18, 2025
6 checks passed
@d4l3k d4l3k deleted the d4l3k/coordination branch February 18, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. lighthouse Lighthouse and quorum related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants