overhaul timeouts for Lighthouse, Manager, checkpoint server #73

d4l3k · 2025-01-15T19:06:39Z

This overhauls the timeouts for all network operations to allow for long quorum timeouts in a safer way.

Notable changes:

removes timeout from the Endpoint/Channel/Client in favor of using keep alives + server based timeouts
adds timeouts to CheckpointServer
requires timeouts to be passed for all Rust operations
adds a new quorum_timeout field to Manager py so you can have a much longer quorum timeout

Test plan:

pytest
cargo test

Jackmin801 · 2025-01-16T01:42:21Z

torchft/manager.py

-                timeout.
+            timeout: the default timeout for all operations
+                Included:
+                    * collectives such as allreduce


Doesnt seem like it currently applies to allreduce

It does since the future from allreduce is wrapped via wrap_future which sets a timeout on the future

Jackmin801 · 2025-01-16T01:45:24Z

src/net.rs

+            max_backoff: Duration::from_secs(10),
+            timeout: connect_timeout,
+            factor: 1.5,
+            jitter: Duration::from_millis(100),


should jitter be random?

oh this is max jitter

renamed for clarity

Jackmin801

lgtm. try_parse_grpc_timeout still seems magic to me but all good if it works!

c-p-i-o · 2025-01-16T04:09:38Z

src/lib.rs

        py.allow_threads(move || {
            let runtime = Runtime::new()?;
            let client = runtime
-                .block_on(manager::manager_client_new(addr, timeout))
+                .block_on(manager::manager_client_new(addr, Duration::from_secs(60)))


you want to move these constants out or parameterize later?

added a connect_timeout setting to Manager+ManagerClient, really should have done this a while back :)

c-p-i-o · 2025-01-16T04:15:33Z

torchft/checkpointing.py

+
+                        self.send_response(200)
+                        self.send_header(
+                            "Content-type", "tensor"


application/octet-stream?

just wrapped this in a try catch but changed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 15, 2025

d4l3k force-pushed the d4l3k/timeout_overhaul branch from e82057c to 663a292 Compare January 15, 2025 23:31

d4l3k marked this pull request as ready for review January 16, 2025 00:07

d4l3k requested review from H-Huang, c-p-i-o and Jackmin801 January 16, 2025 00:07

Jackmin801 reviewed Jan 16, 2025

View reviewed changes

Jackmin801 approved these changes Jan 16, 2025

View reviewed changes

c-p-i-o reviewed Jan 16, 2025

View reviewed changes

c-p-i-o approved these changes Jan 16, 2025

View reviewed changes

overhaul timeouts for Lighthouse, Manager, checkpoint server

4537a5a

d4l3k force-pushed the d4l3k/timeout_overhaul branch from 663a292 to 4537a5a Compare January 16, 2025 18:54

d4l3k merged commit 3ee2360 into main Jan 16, 2025
6 checks passed

d4l3k deleted the d4l3k/timeout_overhaul branch January 16, 2025 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

overhaul timeouts for Lighthouse, Manager, checkpoint server #73

overhaul timeouts for Lighthouse, Manager, checkpoint server #73

Uh oh!

d4l3k commented Jan 15, 2025

Uh oh!

Jackmin801 Jan 16, 2025

Uh oh!

d4l3k Jan 16, 2025

Uh oh!

Jackmin801 Jan 16, 2025

Uh oh!

Jackmin801 Jan 16, 2025

Uh oh!

d4l3k Jan 16, 2025

Uh oh!

Jackmin801 left a comment

Uh oh!

c-p-i-o Jan 16, 2025

Uh oh!

d4l3k Jan 16, 2025

Uh oh!

c-p-i-o Jan 16, 2025

Uh oh!

d4l3k Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

overhaul timeouts for Lighthouse, Manager, checkpoint server #73

overhaul timeouts for Lighthouse, Manager, checkpoint server #73

Uh oh!

Conversation

d4l3k commented Jan 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackmin801 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!