Skip to content

Conversation

@dssgabriel
Copy link
Collaborator

Description

This PR enables running the NCCL backend unit tests on the SNL H100 CI workflow.

Some notes:

  • NCCL is built from source, packaged into an OS-agnostic tarball & installed from that tarball;
  • Kokkos version is explicitly set to 4.7.01 (equivalent to master, aka the latest stable release, at the time of this commit). In the future, we hope to have a container that installs the latest versions of Kokkos/Open MPI/MPICH/NCCL and have it cached by GitHub Actions for reuse in subsequent CI jobs. Submitted jobs will just pull down this container image with all dependencies set up, and only build KokkosComm;
  • For now, KokkosComm's MPI and NCCL backends are configured/built/tested one after the other. This may be improved later on, as both are independent and could theoretically run concurrently.

- NCCL is built from source, packaged into an OS-agnostic tarball &
installed from that tarball. For now, I am unsure if this will actually
work in the container.
- Kokkos version is explicitly set to 4.7.01 (equivalent to master at
the time of this commit). In the future, we hope to have a container
that installs the latest versions of Kokkos/Open MPI/MPICH/NCCL and have
it cached by GHA. Submitted CI jobs will just pull down this container
image with all dependencies set up, and only build KokkosComm.
- For now, KokkosComm's MPI and NCCL backends are
configured/built/tested one after the other. This may be improved later
on, as both are independent and could theoretically run concurrently.
@dssgabriel dssgabriel requested a review from cwpearson October 24, 2025 09:44
@dssgabriel dssgabriel self-assigned this Oct 24, 2025
@dssgabriel dssgabriel added A-ci Area: KokkosComm CI/CD setup A-unit-tests Area: KokkosComm unit tests A-nccl Area: KokkosComm NCCL backend implementation C-maintenance Category: a PR that cleans something up labels Oct 24, 2025
@cwpearson cwpearson added SNL-CI-APPROVAL Required to run SNL CI on non-SNL contributions SNL-CI-SPECIAL-APPROVAL Needed for changes to `.github` to run at SNL labels Nov 7, 2025
@cwpearson cwpearson removed the SNL-CI-APPROVAL Required to run SNL CI on non-SNL contributions label Nov 7, 2025
Copy link
Collaborator

@cwpearson cwpearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed up a few things. I'm seeing invalid datatype for Kokkos::complex<float> in MPI, but I'm not sure this PR is responsible. @dssgabriel thoughts?

Signed-off-by: Carl Pearson <[email protected]>
Do not use MPI sessions, rely on MPI being initialized (done by
`test_main.cpp`) and use `MPI_COMM_WORLD`. This is because OpenMPI 5.0.8
does not support Sessions.
Also simplify NCCL init using their examples.
Improve logging infrastructure.

Signed-off-by: Gabriel Dos Santos <[email protected]>
@dssgabriel
Copy link
Collaborator Author

@cwpearson NCCL tests are passing on our system 🎉

Node configuration is:

  • 4x GH200 w/ quad-rail Eviden BXI v2 interconnect
  • GCC 12.3.0
  • CUDA 12.4
  • Open MPI 5.0.8
  • NCCL 2.21.5

I am still investigating the MPI tests failing on the complex<T> dtypes, though.

@dssgabriel dssgabriel force-pushed the ci/run-nccl-tests branch 4 times, most recently from 1016eab to 81f8069 Compare November 18, 2025 16:29
@cwpearson cwpearson added SNL-CI-SPECIAL-APPROVAL Needed for changes to `.github` to run at SNL SNL-CI-APPROVAL Required to run SNL CI on non-SNL contributions and removed SNL-CI-SPECIAL-APPROVAL Needed for changes to `.github` to run at SNL labels Nov 21, 2025
@cwpearson
Copy link
Collaborator

@dssgabriel and I identified a few more problems:

  • Tough to use a non-blocking communicator for a few reasons
  • NCCL packer was using the non-contiguous span rather than size to compute the count
  • We weren't using device-specific execution space instances

Copy link
Collaborator Author

@dssgabriel dssgabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

We will need to properly check error codes returned by all NCCL calls in the backend. I will open an issue for tracking that.

We should also consider implementing this for the MPI backend and develop a more robust solution for proper error handling.

@cwpearson cwpearson merged commit 587dc7d into kokkos:develop Nov 21, 2025
10 of 11 checks passed
@dssgabriel dssgabriel deleted the ci/run-nccl-tests branch November 21, 2025 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci Area: KokkosComm CI/CD setup A-nccl Area: KokkosComm NCCL backend implementation A-unit-tests Area: KokkosComm unit tests C-maintenance Category: a PR that cleans something up SNL-CI-APPROVAL Required to run SNL CI on non-SNL contributions SNL-CI-SPECIAL-APPROVAL Needed for changes to `.github` to run at SNL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants