-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coordination: expose new low level torchft coordination API #84
Conversation
b4c8807
to
2cf6c4f
Compare
2cf6c4f
to
f2bd4d4
Compare
cc @b0noI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! looks good
/// store_addr (str): The HTTP address of the store server. | ||
/// world_size (int): The world size of the replica group. | ||
/// heartbeat_interval (timedelta): The interval at which heartbeats are sent. | ||
/// connect_timeout (timedelta): The timeout for connecting to the lighthouse server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, is it the standard in rust for the documentation to be above the struct
definition rather than the new
method since the parameters for the constructor aren't shown here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments aren't really standard rust. This format is specifically so pyo3 will pull in the comments so they show up in the Python documentation
Added a screenshot to the test plan so you can see how it renders in sphinx
Sphinx doesn't render __init__
separately by default so it all gets dumped into the class doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks for the SS. Looks great!
This exposes LighthouseServer, ManagerServer and ManagerClient so power users can directly call them to implement custom fault tolerance strategies.
It also renames some modules (
_torchft
for Rust lib) and methods to indicate which ones are private and subject to change.This does not expose a quorum algorithm currently. The plan is to add a new "simple quorum" interface that allows for custom data to be passed between members instead of the data required by the Manager.
We may also want to expose a pluggable quorum algorithm that can be passed into LighthouseServer to customize that as well.
Test plan: