Add a NCCL communication backend to accelerate GPU communication and improve strong scaling

# Background
Heat heavily relies on MPI for communication, but MPI has an issue on GPUs: It's not stream aware.
MPI is not a library, it's a standard. Its goal is to make sure users use a standardized interface that MPI implementations such as OpenMPI or MPICH translate into the expected behavior on all sorts of machines.
Now if some upstart company develops some AI chips that run asynchronously and require slightly different programming, the MPI forum will not just put stuff catering to hardware from specific vendors into the standard. But now we're in a pickle: We all love the standardization of MPI, but this standardization is the reason for severely impairing performance on the hardware that we actually have, which is all NVIDIA GPUs at this point.

The performance issues arise from MPI not partaking in the asynchronous scheduling that GPUs use. When the interpreter, running on CPU, encounters a GPU operation, it will not execute the operation, it will schedule it in a "stream" on the GPU. The GPU will work on that operation whenever in has the resources to do so. The scheduling causes some overhead and we always want to schedule the next operation while something else is running to keep the GPU busy and efficiently utilize the hardware.
Now when the interpreter on CPU encounters an MPI communication, it will do it right away. It will not schedule it to run after previous computations have finished and we communicate undefined garbage, unless we "synchronize" CPU and GPU. Synchronization means we tell the interpreter to wait until the GPU has finished all previous work.
So, when working with MPI and GPUs, we do computation, we wait for the GPU to finish it, then do communication, then continue scheduling computation on GPU.
How big of a problem is this? Depends on how long the compute kernels are running: Say they complete in the order of seconds, then the micro- to mili-second time it takes to interrupt the scheduling during synchronization is a negligible overhead. On the other hand, if we do many small computations with a lot of communication in between, then this is a problem.

The scenario of tiny computations, many communications is what we run into when pushing strong scaling. Here, we take a fixed-size problem and increase the number of GPUs working on this. Eventually, every GPU will only have few data points to do computation on and here the synchronization overhead becomes a massive problem.
On the other hand, if we do weak scaling, we increase the size of data while increasing the number of GPUs to have a fixed computational workload. There, this is not so much of an issue.

Side note on CUDA-aware MPI: This means the MPI implementation can communicate from GPU memory to GPU memory without copying to CPU in between. It can use all available bandwidth from NVLink and infiniband and is not impaired in this regard at all. But it too is not stream aware and the previous discussion applies. In particular, reduce operations require a computation kernel and MPI cannot trigger computation on GPU. So for reduce operations, CUDA-aware MPI will actually copy the data back to CPU, perform the reduce operation, then copy the result to GPU.

# Possible solutions
There is a range of things we can do to address this issue. In the end it boils down to replacing MPI with NCCL in some calls. NCCL is an NVIDIA product which is similar to MPI but is stream aware, so it uses the same cables but solves all the issues discussed above. See [this blogpost](https://x-dev.pages.jsc.fz-juelich.de/2023/07/18/mpi-ucc-nccl.html) by @chelseajohn for some background and performance tests comparing MPI and NCCL.

Now, how do we access NCCL? Here's a few options:
 - [Chelsea's blogpost](https://x-dev.pages.jsc.fz-juelich.de/2023/07/18/mpi-ucc-nccl.html)
 - [nccl4py](https://github.com/NVIDIA/nccl/tree/master/bindings/nccl4py)
 - [torch.distributed](https://docs.pytorch.org/docs/stable/distributed.html)
 - [torch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)

The first option would be only setting environment variables and not touching the code. But it would also be limited to reduce operations. The second and third approach would be the most general: Implement a communication backend using NCCL in the heat communication module. The fourth option would be using some distributed array operations as implemented in torch, which is probably limited to basic operations. If this wasn't the case, we wouldn't need Heat after all.
Note that torch.distributed is probably preferable over nccl4py because it can interface communication libraries from other vendors as well, but pros and cons need to be checked.

# How to
Before doing any sort of work, it should be checked if we can match NCCL performance by purely setting environment variables for relevant use cases. After all, Heat's focus is on weak scaling, not strong scaling.
If we decide we should do some work, we should check which library to use for this and then try to incrementally implement the communication backend. It should inherit from the MPI communication module and overload some functions that can be better handled with NCCL. If you want to know what kind of stuff needs to be done, you can get some inspiration from how we handled in this in [pySDC](https://github.com/Parallel-in-Time/pySDC/blob/5050dd89999c0b004b6cac186fab46108b8173a9/pySDC/helpers/NCCL_communicator.py), although that is a bit simpler than is needed here.
If we don't manage to implement the different communication backends, we just pass select operations to torch DTensor to speed them up. This is the least preferable case because it will clutter the code and be a somewhat manual process for every operation we will add this to. On there other hand, @ClaudiaComito has already implemented this for matrix multiplication in [this branch](features/matmul-dtensor-experiments) and it works great.

# Who should work on this
This issue is right for you if you're not afraid to peek under the hood of Heat's communication. In principle the task boils down to figuring out what can be passed to NCCL instead of MPI and then implementing this. This should be quite doable, but MPI can be a bit abstract. The good thing is that if it is actually complicated, such as collectives with different size data between tasks, we will probably just stick with MPI and you don't need to do anything.
If you want to learn more about MPI, NCCL, and GPUs is a great way to do so.

The scope of this issue is ideal for a student project. It's probably a manageable amount of work and you can do quite a bit of analysis on top of it. Please get in contact if you want to do this during an internship or a thesis. We would be happy to supervise you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a NCCL communication backend to accelerate GPU communication and improve strong scaling #2261

Background

Possible solutions

How to

Who should work on this

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add a NCCL communication backend to accelerate GPU communication and improve strong scaling #2261

Description

Background

Possible solutions

How to

Who should work on this

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions