PGTransport: add inplace transport (3x faster) #119

d4l3k · 2025-02-24T21:29:13Z

This adds a new state_dict argument to PGTransport that when provided will give a state_dict to use for doing in-place tensor operations. This has been measured at ~3x faster.

Test plan:

Correctness:

pytest torchft/checkpointing/pg_transport_test.py

The improvements have been measured via pg_transport_bench.

For inplace operation we see it at ~3x faster (15s -> 4-5s) for 12GB with 3MB tensors size. The remaining overhead is primarily from torchft ProcessGroupBaby queue communication and not proportional to the size of the tensors.

Reducing this overhead requires some careful consideration and will be addressed in a follow up PR.

python torchft/checkpointing/pg_transport_bench.py --device cuda
python torchft/checkpointing/pg_transport_bench.py --device cuda --inplace

inplace

12GB/3MB (4k tensors)

INFO:torchft.checkpointing.pg_transport:send_checkpoint took 5.05398303642869s
INFO:torchft.checkpointing.pg_transport:recv_checkpoint took 5.637796577066183s

16KB/4B (4k tensors)

INFO:torchft.checkpointing.pg_transport:send_checkpoint took 4.909562937915325s
INFO:torchft.checkpointing.pg_transport:recv_checkpoint took 4.766054484993219s

48GB/12MB (4k tensors)

INFO:torchft.checkpointing.pg_transport:send_checkpoint took 18.53099210932851s
INFO:torchft.checkpointing.pg_transport:recv_checkpoint took 18.847779247909784s

not inplace

12GB/3MB (4k tensors)

INFO:torchft.checkpointing.pg_transport:send_checkpoint took 15.791493758559227s
INFO:torchft.checkpointing.pg_transport:recv_checkpoint took 17.16875096037984s

torchft/checkpointing/pg_transport_bench.py

fegin

LGTM

fegin · 2025-02-24T23:45:03Z

torchft/checkpointing/pg_transport.py


            for w in work:
                w.wait(timeout)

    def recv_checkpoint(
        self, src_rank: int, metadata: str, step: int, timeout: timedelta
    ) -> T:
+        state_dict = self._state_dict() if self._state_dict else {}
+        state_dict_leaves, _ = tree_flatten_with_path(state_dict)


Is tree_flatten_with_path a new one? Is it going to give you the FQN?

It's been there for a few versions of torch -- it gives a path like:

(MappingKey(key='user'), MappingKey(key='optimizer'), MappingKey(key='state.layers.7.feed_forward.w2.weight.step'))

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2025

fegin reviewed Feb 24, 2025

View reviewed changes

torchft/checkpointing/pg_transport_bench.py Outdated Show resolved Hide resolved

d4l3k force-pushed the d4l3k/pg_inplace branch from 841ace6 to c7f028f Compare February 24, 2025 21:51

PGTransport: add inplace transport

2f162a5

d4l3k force-pushed the d4l3k/pg_inplace branch from c7f028f to 2f162a5 Compare February 24, 2025 21:54

d4l3k marked this pull request as ready for review February 24, 2025 21:55

d4l3k requested review from allenwang28, H-Huang and fegin February 24, 2025 21:55

fegin approved these changes Feb 25, 2025

View reviewed changes

d4l3k merged commit 6fe4c8e into main Feb 25, 2025
6 checks passed

d4l3k deleted the d4l3k/pg_inplace branch February 25, 2025 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PGTransport: add inplace transport (3x faster) #119

PGTransport: add inplace transport (3x faster) #119

Uh oh!

d4l3k commented Feb 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

fegin left a comment

Uh oh!

fegin Feb 24, 2025

Uh oh!

d4l3k Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

PGTransport: add inplace transport (3x faster) #119

PGTransport: add inplace transport (3x faster) #119

Uh oh!

Conversation

d4l3k commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan:

inplace

12GB/3MB (4k tensors)

16KB/4B (4k tensors)

48GB/12MB (4k tensors)

not inplace

12GB/3MB (4k tensors)

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d4l3k commented Feb 24, 2025 •

edited

Loading