-
Notifications
You must be signed in to change notification settings - Fork 12
ci: run NCCL unit tests on SNL CI #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- NCCL is built from source, packaged into an OS-agnostic tarball & installed from that tarball. For now, I am unsure if this will actually work in the container. - Kokkos version is explicitly set to 4.7.01 (equivalent to master at the time of this commit). In the future, we hope to have a container that installs the latest versions of Kokkos/Open MPI/MPICH/NCCL and have it cached by GHA. Submitted CI jobs will just pull down this container image with all dependencies set up, and only build KokkosComm. - For now, KokkosComm's MPI and NCCL backends are configured/built/tested one after the other. This may be improved later on, as both are independent and could theoretically run concurrently.
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
f97f811 to
cb44b75
Compare
cwpearson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed up a few things. I'm seeing invalid datatype for Kokkos::complex<float> in MPI, but I'm not sure this PR is responsible. @dssgabriel thoughts?
Signed-off-by: Carl Pearson <[email protected]>
0be51f7 to
55dc946
Compare
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
55dc946 to
af5f435
Compare
Signed-off-by: Carl Pearson <[email protected]>
f71a381 to
2473c1f
Compare
Do not use MPI sessions, rely on MPI being initialized (done by `test_main.cpp`) and use `MPI_COMM_WORLD`. This is because OpenMPI 5.0.8 does not support Sessions. Also simplify NCCL init using their examples. Improve logging infrastructure. Signed-off-by: Gabriel Dos Santos <[email protected]>
2473c1f to
454db3c
Compare
|
@cwpearson NCCL tests are passing on our system 🎉 Node configuration is:
I am still investigating the MPI tests failing on the |
Signed-off-by: Gabriel Dos Santos <[email protected]>
Signed-off-by: Gabriel Dos Santos <[email protected]>
1016eab to
81f8069
Compare
Signed-off-by: Gabriel Dos Santos <[email protected]>
81f8069 to
5b18715
Compare
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
Signed-off-by: Carl Pearson <[email protected]>
|
@dssgabriel and I identified a few more problems:
|
dssgabriel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
We will need to properly check error codes returned by all NCCL calls in the backend. I will open an issue for tracking that.
We should also consider implementing this for the MPI backend and develop a more robust solution for proper error handling.
Description
This PR enables running the NCCL backend unit tests on the SNL H100 CI workflow.
Some notes: