Skip to content

Use torch.distributed as alternative communication backend for Heat #1772

Closed
@mrfh92

Description

Related
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265

Feature functionality
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow torch.distributed as alternative backend for communication.

So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py.
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no ...v-operations are supported; to deal with them, workaround need to be create.

The overall idea would be that one can run a Heat code script.py both via mpirun -n 4 python script.py or torchrun --nproc-per-node=4 script.py (or similar) and the required backend is chosen automatically.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions