Use `torch.distributed` as alternative communication backend for Heat

**Related**
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265 

**Feature functionality**
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow `torch.distributed` as alternative backend for communication. 

So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py. 
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no `...v`-operations are supported; to deal with them, workaround need to be create. 

The overall idea would be that one can run a Heat code `script.py` both via `mpirun -n 4 python script.py` or `torchrun  --nproc-per-node=4 script.py` (or similar) and the required backend is chosen automatically.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `torch.distributed` as alternative communication backend for Heat #1772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use torch.distributed as alternative communication backend for Heat #1772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Use `torch.distributed` as alternative communication backend for Heat #1772