Skip to content

Use torch.distributed as alternative communication backend for Heat #1772

Closed
@mrfh92

Description

@mrfh92

Related
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265

Feature functionality
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow torch.distributed as alternative backend for communication.

So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py.
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no ...v-operations are supported; to deal with them, workaround need to be create.

The overall idea would be that one can run a Heat code script.py both via mpirun -n 4 python script.py or torchrun --nproc-per-node=4 script.py (or similar) and the required backend is chosen automatically.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions