GPU Collectives via NCCL/RCCL

AMReX has a few central, latency-intensity and time-intensive collective routines, e.g., for halo exchange of field and particles.

These are ideal places to use device-initiated collectives in NCCL/RCCL to improve performance.

Examples:
- https://github.com/NVIDIA/nccl/tree/master/examples/06_device_api
- https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/deviceapi.html

## MPI Note

NCCL/RCCL are shipped with the usual CUDA Toolkit / ROCm stack, so add no further hard-to-ship dependency.

MPI is very stagnant when it comes to GPU support (as of MPI-5.0, the only section that mentions GPUs at all is the one that has the GPU-aware pointers), and there is (as of SC25) no mid-term, foreseeable device-initiated comm coming to MPI.
At the same time, collectives at massive scale are optimized heavily for ML tasks in NCCL/RCCL.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Collectives via NCCL/RCCL #4821

MPI Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Collectives via NCCL/RCCL #4821

Description

MPI Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions