Skip to content

GPU Collectives via NCCL/RCCL #4821

@ax3l

Description

@ax3l

AMReX has a few central, latency-intensity and time-intensive collective routines, e.g., for halo exchange of field and particles.

These are ideal places to use device-initiated collectives in NCCL/RCCL to improve performance.

Examples:

MPI Note

NCCL/RCCL are shipped with the usual CUDA Toolkit / ROCm stack, so add no further hard-to-ship dependency.

MPI is very stagnant when it comes to GPU support (as of MPI-5.0, the only section that mentions GPUs at all is the one that has the GPU-aware pointers), and there is (as of SC25) no mid-term, foreseeable device-initiated comm coming to MPI.
At the same time, collectives at massive scale are optimized heavily for ML tasks in NCCL/RCCL.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions