TL/CUDA: Add INT32, INT64, UINT32, UINT64 support for NVLS#1259
TL/CUDA: Add INT32, INT64, UINT32, UINT64 support for NVLS#1259Juee14Desai wants to merge 2 commits intoopenucx:masterfrom
Conversation
|
/build |
31c35ba to
bed3dc0
Compare
|
/build |
|
| Filename | Overview |
|---|---|
| .ci/pipeline/test_nvls_matrix.yaml | Splits the single MPI test step into allreduce and reduce_scatter steps, but the always: stop_slurm_allocation.sh cleanup block is only placed on the new reduce_scatter step — the allreduce step has no cleanup handler, risking a leaked Slurm allocation if it fails. |
| src/components/tl/cuda/kernels/nvls.cuh | Adds NvlsInt32Ops, NvlsUint32Ops, NvlsInt64Ops, and NvlsUint64Ops structs. PTX types chosen are valid: add.s32 is supported by multimem; s64 is not, so u64 is used for both 64-bit variants. NvlsInt64Ops and NvlsUint64Ops are functionally identical (noted in an existing thread). |
| src/components/tl/cuda/kernels/allreduce_kernel.cu | Adds allreduce_kernel_scalar32 (v4 unroll) and allreduce_kernel_scalar64 (v2 unroll) kernels and dispatches them for the new types. The loop bounds (idx < chunk_end while accessing idx+3 or idx+1) remain a potential out-of-bounds concern when chunk sizes are not aligned to the unroll factor (noted in prior review threads). |
| src/components/tl/cuda/kernels/reduce_scatter_kernel.cu | Adds reduce_scatter_kernel_scalar32 and reduce_scatter_kernel_scalar64 with proper idx+3 < and idx+1 < guards (fixing the OOB pattern from the vectorised kernels). Dispatch correctly converts u32-unit offset/count to u64 units for the 64-bit path. |
| src/components/tl/cuda/reduce_scatterv/reduce_scatterv_nvls.c | Extends validation and offset/count conversion to cover INT32, UINT32, INT64, and UINT64. Renames offset_u32/count_u32 to offset/count to reflect their polymorphic meaning. Alignment checks correctly enforce that INT64/UINT64 element counts are even (multiples of 2) before dispatch. |
Comments Outside Diff (1)
-
.ci/pipeline/test_nvls_matrix.yaml, line 83-90 (link)Missing
always:cleanup block on allreduce stepThe original single test step had an
always:handler to stop the Slurm allocation unconditionally. After splitting into two steps, thealways: stop_slurm_allocation.shwas moved exclusively to the reduce_scatter step (line 100–101). The allreduce step has no cleanup handler of its own.If the allreduce step fails and the CI system stops executing subsequent steps (instead of continuing to the reduce_scatter step),
stop_slurm_allocation.shwill never run and the Slurm job will remain allocated until theSLURM_JOB_TIMEOUTwall-clock limit is hit, consuming cluster resources unnecessarily.The allreduce step should also carry its own
always:block (or anonfail:at minimum):- name: Run UCC NVLS MPI tests (allreduce) containerSelector: "{name: 'build_helper'}" timeout: "${TEST_TIMEOUT_MINUTES}" run: | set -x export DOCKER_IMAGE_NAME="${registry_host}#torch-ucc/${UCC_URI_SUFFIX}:${DOCKER_IMAGE_TAG}" export SLURM_JOB_ID=$(cat ${WORKSPACE}/job_id.txt) sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/run_nvls_slurm.sh '/opt/nvidia/src/ucc/.ci/scripts/run_tests_ucc_nvls_mpi.sh' ${NVLS_MPI_PPN:-4} onfail: | sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/stop_slurm_allocation.sh
Last reviewed commit: fe24e1b
ikryukov
left a comment
There was a problem hiding this comment.
Need to get nvls ci working.
a66fcc6 to
94fb9da
Compare
|
/build |
|
/build |
1 similar comment
|
/build |
|
/build |
Additional Comments (1)
Add validation like |
|
ucc_test_mpi log for correctness check: |
This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter: - INT32 (s32): 32-bit signed integer - INT64 (s64): 64-bit signed integer - UINT32 (u32): 32-bit unsigned integer - UINT64 (u64): 64-bit unsigned integer Added PTX multimem.ld_reduce and multimem.st instructions for each type Created NvlsOps structs for type-specific operations Updated allreduce and reduce_scatter kernels with new data type handling Modified validation logic to accept new data types Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>
This commit adds new data types and enables reduce scatter mpi test for nvls CI. Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>
|
/build |
|
/build |
1 similar comment
|
/build |
|
/build |
|
/build |
|
/build |
This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter:
INT32 (s32): 32-bit signed integer with v4 vectorization
INT64 (s64): 64-bit signed integer with v2 vectorization
UINT32 (u32): 32-bit unsigned integer with v4 vectorization
UINT64 (u64): 64-bit unsigned integer with v2 vectorization
Added PTX multimem.ld_reduce and multimem.st instructions for each type
Created NvlsOps structs for type-specific operations
Updated allreduce and reduce_scatter kernels with new data type handling
Modified validation logic to accept new data types