Description
We currently only support a subset of collective APIs in torchft.ProcessGroup
. We want to support all of them in all of our ProcessGroup implementations.
This can be done piece by piece and we don't need to add all of them right now.
The API definition for ProcessGroup
is located at https://github.com/pytorch/torchft/blob/main/torchft/process_group.py#L123
We need to support these for at least ProcessGroupWrapper, ManagedProcessGroup, ProcessGroupBaby*.
We want to support all of the collectives in PyProcessGroup
: https://github.com/pytorch/pytorch/blob/11f69808c64a65c68a4452250ba7719dcff27c78/torch/csrc/distributed/c10d/PyProcessGroup.hpp
- allgather_into_tensor_coalesced
- allreduce_coalesced
- alltoall_base
- barrier
- reduce_scatter
- reduce_scatter_tensor_coalesced
- send/recv (trickier)
Testing:
We have an existing test suite for collectives. We should add tests for these new types to _test_pg
and make sure all of our PG implementations support them.
https://github.com/pytorch/torchft/blob/main/torchft/process_group_test.py#L96
Activity