-
Notifications
You must be signed in to change notification settings - Fork 428
Open
Labels
GPUdependenciesPull requests that update a dependency filePull requests that update a dependency fileenhancementhelp wantedperformance
Description
AMReX has a few central, latency-intensity and time-intensive collective routines, e.g., for halo exchange of field and particles.
These are ideal places to use device-initiated collectives in NCCL/RCCL to improve performance.
Examples:
- https://github.com/NVIDIA/nccl/tree/master/examples/06_device_api
- https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/deviceapi.html
MPI Note
NCCL/RCCL are shipped with the usual CUDA Toolkit / ROCm stack, so add no further hard-to-ship dependency.
MPI is very stagnant when it comes to GPU support (as of MPI-5.0, the only section that mentions GPUs at all is the one that has the GPU-aware pointers), and there is (as of SC25) no mid-term, foreseeable device-initiated comm coming to MPI.
At the same time, collectives at massive scale are optimized heavily for ML tasks in NCCL/RCCL.
Metadata
Metadata
Assignees
Labels
GPUdependenciesPull requests that update a dependency filePull requests that update a dependency fileenhancementhelp wantedperformance