Skip to content

TL/CUDA: Add INT32, INT64, UINT32, UINT64 support for NVLS#1259

Open
Juee14Desai wants to merge 2 commits intoopenucx:masterfrom
Juee14Desai:master
Open

TL/CUDA: Add INT32, INT64, UINT32, UINT64 support for NVLS#1259
Juee14Desai wants to merge 2 commits intoopenucx:masterfrom
Juee14Desai:master

Conversation

@Juee14Desai
Copy link
Collaborator

This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter:

  • INT32 (s32): 32-bit signed integer with v4 vectorization

  • INT64 (s64): 64-bit signed integer with v2 vectorization

  • UINT32 (u32): 32-bit unsigned integer with v4 vectorization

  • UINT64 (u64): 64-bit unsigned integer with v2 vectorization

  • Added PTX multimem.ld_reduce and multimem.st instructions for each type

  • Created NvlsOps structs for type-specific operations

  • Updated allreduce and reduce_scatter kernels with new data type handling

  • Modified validation logic to accept new data types

@janjust janjust requested a review from ikryukov January 29, 2026 16:09
@janjust
Copy link
Collaborator

janjust commented Feb 12, 2026

/build

@Juee14Desai Juee14Desai force-pushed the master branch 5 times, most recently from 31c35ba to bed3dc0 Compare February 20, 2026 00:50
@Juee14Desai Juee14Desai requested a review from ikryukov February 20, 2026 00:59
@ikryukov
Copy link
Collaborator

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 20, 2026

Greptile Summary

This PR extends NVLS (NVLink SHARP) collective operations (allreduce and reduce_scatter) to support four additional integer types — INT32, UINT32 (v4 scalar unroll) and INT64, UINT64 (v2 scalar unroll) — in addition to the existing FP32 and BF16 support. New PTX multimem.ld_reduce / multimem.st ops, kernel templates, and dispatch logic are added consistently across both collective paths, and CI is updated with a dedicated reduce_scatter test script and expanded dtype coverage.

Key observations:

  • The PTX type selection is correct: multimem.ld_reduce.global.add.s32 is a valid instruction; s64 is not supported by the hardware, so u64 is used for both INT64 and UINT64 ops (two's complement addition produces identical bit patterns).
  • The reduce_scatter scalar kernels correctly adopt the idx + N < end loop-bound guard (instead of idx < end) to avoid out-of-bounds accesses when chunk sizes are not a multiple of the unroll factor.
  • The allreduce scalar kernels (lines 70 and 105 of allreduce_kernel.cu) still use idx < chunk_end while performing multi-element accesses — a pre-existing concern noted in the previous review threads.
  • The reduce_scatterv init path correctly validates that INT64/UINT64 element counts are even (enforcing 16-byte alignment) before dispatch.
  • The CI pipeline splits the single MPI test step into two (allreduce + reduce_scatter), but the always: stop_slurm_allocation.sh cleanup was moved entirely to the reduce_scatter step, leaving the allreduce step without a cleanup handler.

Confidence Score: 3/5

  • The core kernel logic is largely correct, but a CI cleanup regression and pre-existing loop-bound concerns in the new allreduce scalar kernels warrant attention before merging.
  • The new reduce_scatter kernels are well-guarded and the reduce_scatterv init path validates alignment correctly. However, the allreduce scalar64/scalar32 kernels inherit the idx < chunk_end loop bound while accessing idx+1/idx+3, which can silently mis-reduce the last element(s) for non-aligned counts — a concern already raised in the existing review threads. Additionally, moving the Slurm cleanup always: block exclusively to the reduce_scatter step creates a resource-leak risk when the allreduce step fails and CI skips subsequent steps.
  • src/components/tl/cuda/kernels/allreduce_kernel.cu (loop-bound safety for scalar kernels) and .ci/pipeline/test_nvls_matrix.yaml (missing cleanup handler on the allreduce step).

Important Files Changed

Filename Overview
.ci/pipeline/test_nvls_matrix.yaml Splits the single MPI test step into allreduce and reduce_scatter steps, but the always: stop_slurm_allocation.sh cleanup block is only placed on the new reduce_scatter step — the allreduce step has no cleanup handler, risking a leaked Slurm allocation if it fails.
src/components/tl/cuda/kernels/nvls.cuh Adds NvlsInt32Ops, NvlsUint32Ops, NvlsInt64Ops, and NvlsUint64Ops structs. PTX types chosen are valid: add.s32 is supported by multimem; s64 is not, so u64 is used for both 64-bit variants. NvlsInt64Ops and NvlsUint64Ops are functionally identical (noted in an existing thread).
src/components/tl/cuda/kernels/allreduce_kernel.cu Adds allreduce_kernel_scalar32 (v4 unroll) and allreduce_kernel_scalar64 (v2 unroll) kernels and dispatches them for the new types. The loop bounds (idx < chunk_end while accessing idx+3 or idx+1) remain a potential out-of-bounds concern when chunk sizes are not aligned to the unroll factor (noted in prior review threads).
src/components/tl/cuda/kernels/reduce_scatter_kernel.cu Adds reduce_scatter_kernel_scalar32 and reduce_scatter_kernel_scalar64 with proper idx+3 < and idx+1 < guards (fixing the OOB pattern from the vectorised kernels). Dispatch correctly converts u32-unit offset/count to u64 units for the 64-bit path.
src/components/tl/cuda/reduce_scatterv/reduce_scatterv_nvls.c Extends validation and offset/count conversion to cover INT32, UINT32, INT64, and UINT64. Renames offset_u32/count_u32 to offset/count to reflect their polymorphic meaning. Alignment checks correctly enforce that INT64/UINT64 element counts are even (multiples of 2) before dispatch.

Comments Outside Diff (1)

  1. .ci/pipeline/test_nvls_matrix.yaml, line 83-90 (link)

    Missing always: cleanup block on allreduce step

    The original single test step had an always: handler to stop the Slurm allocation unconditionally. After splitting into two steps, the always: stop_slurm_allocation.sh was moved exclusively to the reduce_scatter step (line 100–101). The allreduce step has no cleanup handler of its own.

    If the allreduce step fails and the CI system stops executing subsequent steps (instead of continuing to the reduce_scatter step), stop_slurm_allocation.sh will never run and the Slurm job will remain allocated until the SLURM_JOB_TIMEOUT wall-clock limit is hit, consuming cluster resources unnecessarily.

    The allreduce step should also carry its own always: block (or an onfail: at minimum):

      - name: Run UCC NVLS MPI tests (allreduce)
        containerSelector: "{name: 'build_helper'}"
        timeout: "${TEST_TIMEOUT_MINUTES}"
        run: |
          set -x
          export DOCKER_IMAGE_NAME="${registry_host}#torch-ucc/${UCC_URI_SUFFIX}:${DOCKER_IMAGE_TAG}"
          export SLURM_JOB_ID=$(cat ${WORKSPACE}/job_id.txt)
          sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/run_nvls_slurm.sh '/opt/nvidia/src/ucc/.ci/scripts/run_tests_ucc_nvls_mpi.sh' ${NVLS_MPI_PPN:-4}
        onfail: |
          sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/stop_slurm_allocation.sh

Last reviewed commit: fe24e1b

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@ikryukov ikryukov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to get nvls ci working.

@Juee14Desai Juee14Desai force-pushed the master branch 2 times, most recently from a66fcc6 to 94fb9da Compare February 24, 2026 06:55
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Juee14Desai Juee14Desai requested a review from ikryukov February 25, 2026 01:44
@ikryukov
Copy link
Collaborator

/build

@Juee14Desai
Copy link
Collaborator Author

/build

1 similar comment
@Juee14Desai
Copy link
Collaborator Author

/build

@ikryukov
Copy link
Collaborator

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Additional Comments (1)

src/components/tl/cuda/allreduce/allreduce_nvls.c, line 201
Missing alignment validation for integer types. The scalar32 kernel processes 4 elements per thread (lines 70-79 in allreduce_kernel.cu), and scalar64 processes 2 (lines 105-110). Without alignment checks, per-rank chunks may have unaligned sizes (e.g., count=1000 with 3 ranks gives chunks of 333, 333, 334 elements), causing tail elements to be skipped.

Add validation like reduce_scatterv_nvls.c:227:

    if (buf_size < 1024 || coll_args->args.op != UCC_OP_SUM ||
        (coll_args->args.dst.info.datatype != UCC_DT_FLOAT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_BFLOAT16 &&
         coll_args->args.dst.info.datatype != UCC_DT_INT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_UINT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_INT64 &&
         coll_args->args.dst.info.datatype != UCC_DT_UINT64)) {
        tl_debug(
            UCC_TL_TEAM_LIB(team),
            "NVLS allreduce is supported only with SUM operation "
            "and float32, bfloat16, int32, uint32, int64, or uint64 datatype, with message size >= 1024 "
            "bytes");
        return UCC_ERR_NOT_SUPPORTED;
    }
    
    // Validate alignment for integer types (4-element processing for 32-bit, 2-element for 64-bit)
    size_t count = coll_args->args.dst.info.count;
    ucc_datatype_t dt = coll_args->args.dst.info.datatype;
    if ((dt == UCC_DT_INT32 || dt == UCC_DT_UINT32) && count % 4 != 0) {
        tl_debug(UCC_TL_TEAM_LIB(team),
                 "INT32/UINT32 count must be multiple of 4, got %zu", count);
        return UCC_ERR_NOT_SUPPORTED;
    }
    if ((dt == UCC_DT_INT64 || dt == UCC_DT_UINT64) && count % 2 != 0) {
        tl_debug(UCC_TL_TEAM_LIB(team),
                 "INT64/UINT64 count must be multiple of 2, got %zu", count);
        return UCC_ERR_NOT_SUPPORTED;
    }

@janjust janjust requested a review from QiaoK February 26, 2026 16:08
@Juee14Desai
Copy link
Collaborator Author

ucc_test_mpi log for correctness check:
run-tests-53303.txt

This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter:

- INT32 (s32): 32-bit signed integer

- INT64 (s64): 64-bit signed integer

- UINT32 (u32): 32-bit unsigned integer

- UINT64 (u64): 64-bit unsigned integer

Added PTX multimem.ld_reduce and multimem.st instructions for each type

Created NvlsOps structs for type-specific operations

Updated allreduce and reduce_scatter kernels with new data type handling

Modified validation logic to accept new data types

Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>
This commit adds new data types and enables reduce scatter
mpi test for nvls CI.

Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>
@Juee14Desai
Copy link
Collaborator Author

/build

@Juee14Desai
Copy link
Collaborator Author

/build

1 similar comment
@Juee14Desai
Copy link
Collaborator Author

/build

@Juee14Desai
Copy link
Collaborator Author

/build

@Juee14Desai
Copy link
Collaborator Author

/build

@Juee14Desai
Copy link
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants