-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overhaul timeouts for Lighthouse, Manager, checkpoint server #73
Conversation
e82057c
to
663a292
Compare
timeout. | ||
timeout: the default timeout for all operations | ||
Included: | ||
* collectives such as allreduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesnt seem like it currently applies to allreduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does since the future from allreduce is wrapped via wrap_future
which sets a timeout on the future
src/net.rs
Outdated
max_backoff: Duration::from_secs(10), | ||
timeout: connect_timeout, | ||
factor: 1.5, | ||
jitter: Duration::from_millis(100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should jitter be random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is max jitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. try_parse_grpc_timeout
still seems magic to me but all good if it works!
src/lib.rs
Outdated
py.allow_threads(move || { | ||
let runtime = Runtime::new()?; | ||
let client = runtime | ||
.block_on(manager::manager_client_new(addr, timeout)) | ||
.block_on(manager::manager_client_new(addr, Duration::from_secs(60))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you want to move these constants out or parameterize later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a connect_timeout
setting to Manager+ManagerClient, really should have done this a while back :)
torchft/checkpointing.py
Outdated
|
||
self.send_response(200) | ||
self.send_header( | ||
"Content-type", "tensor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
application/octet-stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wrapped this in a try catch but changed
663a292
to
4537a5a
Compare
This overhauls the timeouts for all network operations to allow for long quorum timeouts in a safer way.
Notable changes:
quorum_timeout
field to Manager py so you can have a much longer quorum timeoutTest plan: