TL/CUDA: Add INT32, INT64, UINT32, UINT64 support for NVLS by Juee14Desai · Pull Request #1259 · openucx/ucc

Juee14Desai · 2026-01-23T07:49:18Z

This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter:

INT32 (s32): 32-bit signed integer with v4 vectorization
INT64 (s64): 64-bit signed integer with v2 vectorization
UINT32 (u32): 32-bit unsigned integer with v4 vectorization
UINT64 (u64): 64-bit unsigned integer with v2 vectorization
Added PTX multimem.ld_reduce and multimem.st instructions for each type
Created NvlsOps structs for type-specific operations
Updated allreduce and reduce_scatter kernels with new data type handling
Modified validation logic to accept new data types

src/components/tl/cuda/allreduce/allreduce_nvls.c

src/components/tl/cuda/kernels/allreduce_kernel.cu

src/components/tl/cuda/kernels/nvls.cuh

janjust · 2026-02-12T16:29:42Z

/build

.github/workflows/main.yaml

ikryukov · 2026-02-20T12:16:03Z

/build

greptile-apps · 2026-02-20T14:08:28Z

Greptile Summary

This PR extends NVLS (NVLink SHARP) collective operations (allreduce and reduce_scatter) to support four additional integer types — INT32, UINT32 (v4 scalar unroll) and INT64, UINT64 (v2 scalar unroll) — in addition to the existing FP32 and BF16 support. New PTX multimem.ld_reduce / multimem.st ops, kernel templates, and dispatch logic are added consistently across both collective paths, and CI is updated with a dedicated reduce_scatter test script and expanded dtype coverage.

Key observations:

The PTX type selection is correct: multimem.ld_reduce.global.add.s32 is a valid instruction; s64 is not supported by the hardware, so u64 is used for both INT64 and UINT64 ops (two's complement addition produces identical bit patterns).
The reduce_scatter scalar kernels correctly adopt the idx + N < end loop-bound guard (instead of idx < end) to avoid out-of-bounds accesses when chunk sizes are not a multiple of the unroll factor.
The allreduce scalar kernels (lines 70 and 105 of allreduce_kernel.cu) still use idx < chunk_end while performing multi-element accesses — a pre-existing concern noted in the previous review threads.
The reduce_scatterv init path correctly validates that INT64/UINT64 element counts are even (enforcing 16-byte alignment) before dispatch.
The CI pipeline splits the single MPI test step into two (allreduce + reduce_scatter), but the always: stop_slurm_allocation.sh cleanup was moved entirely to the reduce_scatter step, leaving the allreduce step without a cleanup handler.

Confidence Score: 3/5

The core kernel logic is largely correct, but a CI cleanup regression and pre-existing loop-bound concerns in the new allreduce scalar kernels warrant attention before merging.
The new reduce_scatter kernels are well-guarded and the reduce_scatterv init path validates alignment correctly. However, the allreduce scalar64/scalar32 kernels inherit the idx < chunk_end loop bound while accessing idx+1/idx+3, which can silently mis-reduce the last element(s) for non-aligned counts — a concern already raised in the existing review threads. Additionally, moving the Slurm cleanup always: block exclusively to the reduce_scatter step creates a resource-leak risk when the allreduce step fails and CI skips subsequent steps.
src/components/tl/cuda/kernels/allreduce_kernel.cu (loop-bound safety for scalar kernels) and .ci/pipeline/test_nvls_matrix.yaml (missing cleanup handler on the allreduce step).

Important Files Changed

Filename	Overview
.ci/pipeline/test_nvls_matrix.yaml	Splits the single MPI test step into allreduce and reduce_scatter steps, but the `always: stop_slurm_allocation.sh` cleanup block is only placed on the new reduce_scatter step — the allreduce step has no cleanup handler, risking a leaked Slurm allocation if it fails.
src/components/tl/cuda/kernels/nvls.cuh	Adds `NvlsInt32Ops`, `NvlsUint32Ops`, `NvlsInt64Ops`, and `NvlsUint64Ops` structs. PTX types chosen are valid: `add.s32` is supported by multimem; `s64` is not, so `u64` is used for both 64-bit variants. `NvlsInt64Ops` and `NvlsUint64Ops` are functionally identical (noted in an existing thread).
src/components/tl/cuda/kernels/allreduce_kernel.cu	Adds `allreduce_kernel_scalar32` (v4 unroll) and `allreduce_kernel_scalar64` (v2 unroll) kernels and dispatches them for the new types. The loop bounds (`idx < chunk_end` while accessing `idx+3` or `idx+1`) remain a potential out-of-bounds concern when chunk sizes are not aligned to the unroll factor (noted in prior review threads).
src/components/tl/cuda/kernels/reduce_scatter_kernel.cu	Adds `reduce_scatter_kernel_scalar32` and `reduce_scatter_kernel_scalar64` with proper `idx+3 <` and `idx+1 <` guards (fixing the OOB pattern from the vectorised kernels). Dispatch correctly converts u32-unit offset/count to u64 units for the 64-bit path.
src/components/tl/cuda/reduce_scatterv/reduce_scatterv_nvls.c	Extends validation and offset/count conversion to cover INT32, UINT32, INT64, and UINT64. Renames `offset_u32`/`count_u32` to `offset`/`count` to reflect their polymorphic meaning. Alignment checks correctly enforce that INT64/UINT64 element counts are even (multiples of 2) before dispatch.

Comments Outside Diff (1)

.ci/pipeline/test_nvls_matrix.yaml, line 83-90 (link)

Missing always: cleanup block on allreduce step

The original single test step had an always: handler to stop the Slurm allocation unconditionally. After splitting into two steps, the always: stop_slurm_allocation.sh was moved exclusively to the reduce_scatter step (line 100–101). The allreduce step has no cleanup handler of its own.

If the allreduce step fails and the CI system stops executing subsequent steps (instead of continuing to the reduce_scatter step), stop_slurm_allocation.sh will never run and the Slurm job will remain allocated until the SLURM_JOB_TIMEOUT wall-clock limit is hit, consuming cluster resources unnecessarily.

The allreduce step should also carry its own always: block (or an onfail: at minimum):
```
  - name: Run UCC NVLS MPI tests (allreduce)
    containerSelector: "{name: 'build_helper'}"
    timeout: "${TEST_TIMEOUT_MINUTES}"
    run: |
      set -x
      export DOCKER_IMAGE_NAME="${registry_host}#torch-ucc/${UCC_URI_SUFFIX}:${DOCKER_IMAGE_TAG}"
      export SLURM_JOB_ID=$(cat ${WORKSPACE}/job_id.txt)
      sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/run_nvls_slurm.sh '/opt/nvidia/src/ucc/.ci/scripts/run_tests_ucc_nvls_mpi.sh' ${NVLS_MPI_PPN:-4}
    onfail: |
      sudo -E -u svcnbu-swx-hpcx ${WORKSPACE}/.ci/scripts/stop_slurm_allocation.sh
```

_{Last reviewed commit: fe24e1b}

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ikryukov

Need to get nvls ci working.

.ci/scripts/run_tests_ucc_nvls.sh

src/components/tl/cuda/kernels/nvls.cuh

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

src/components/tl/cuda/kernels/nvls.cuh

src/components/tl/cuda/kernels/allreduce_kernel.cu

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ikryukov · 2026-02-25T09:26:53Z

/build

Juee14Desai · 2026-02-26T07:45:26Z

/build

Juee14Desai · 2026-02-26T08:00:25Z

/build

ikryukov · 2026-02-26T16:03:04Z

/build

greptile-apps · 2026-02-26T16:06:30Z

Additional Comments (1)

src/components/tl/cuda/allreduce/allreduce_nvls.c, line 201
Missing alignment validation for integer types. The scalar32 kernel processes 4 elements per thread (lines 70-79 in allreduce_kernel.cu), and scalar64 processes 2 (lines 105-110). Without alignment checks, per-rank chunks may have unaligned sizes (e.g., count=1000 with 3 ranks gives chunks of 333, 333, 334 elements), causing tail elements to be skipped.

Add validation like reduce_scatterv_nvls.c:227:

    if (buf_size < 1024 || coll_args->args.op != UCC_OP_SUM ||
        (coll_args->args.dst.info.datatype != UCC_DT_FLOAT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_BFLOAT16 &&
         coll_args->args.dst.info.datatype != UCC_DT_INT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_UINT32 &&
         coll_args->args.dst.info.datatype != UCC_DT_INT64 &&
         coll_args->args.dst.info.datatype != UCC_DT_UINT64)) {
        tl_debug(
            UCC_TL_TEAM_LIB(team),
            "NVLS allreduce is supported only with SUM operation "
            "and float32, bfloat16, int32, uint32, int64, or uint64 datatype, with message size >= 1024 "
            "bytes");
        return UCC_ERR_NOT_SUPPORTED;
    }
    
    // Validate alignment for integer types (4-element processing for 32-bit, 2-element for 64-bit)
    size_t count = coll_args->args.dst.info.count;
    ucc_datatype_t dt = coll_args->args.dst.info.datatype;
    if ((dt == UCC_DT_INT32 || dt == UCC_DT_UINT32) && count % 4 != 0) {
        tl_debug(UCC_TL_TEAM_LIB(team),
                 "INT32/UINT32 count must be multiple of 4, got %zu", count);
        return UCC_ERR_NOT_SUPPORTED;
    }
    if ((dt == UCC_DT_INT64 || dt == UCC_DT_UINT64) && count % 2 != 0) {
        tl_debug(UCC_TL_TEAM_LIB(team),
                 "INT64/UINT64 count must be multiple of 2, got %zu", count);
        return UCC_ERR_NOT_SUPPORTED;
    }

Juee14Desai · 2026-02-26T21:36:52Z

ucc_test_mpi log for correctness check:
run-tests-53303.txt

src/components/tl/cuda/kernels/allreduce_kernel.cu

src/components/tl/cuda/reduce_scatterv/reduce_scatterv_nvls.c

src/components/tl/cuda/kernels/reduce_scatter_kernel.cu

This PR adds support for the following additional integer data types in NVLS (NVLink SHARP) collective operations allreduce and reduce_scatter: - INT32 (s32): 32-bit signed integer - INT64 (s64): 64-bit signed integer - UINT32 (u32): 32-bit unsigned integer - UINT64 (u64): 64-bit unsigned integer Added PTX multimem.ld_reduce and multimem.st instructions for each type Created NvlsOps structs for type-specific operations Updated allreduce and reduce_scatter kernels with new data type handling Modified validation logic to accept new data types Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>

This commit adds new data types and enables reduce scatter mpi test for nvls CI. Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>

Juee14Desai · 2026-03-11T05:31:20Z

/build

Juee14Desai · 2026-03-11T17:21:57Z

/build

Juee14Desai · 2026-03-11T17:30:44Z

/build

Juee14Desai · 2026-03-11T18:31:39Z

/build

Juee14Desai · 2026-03-11T19:46:35Z

/build

Juee14Desai · 2026-03-11T21:24:06Z

/build

janjust requested a review from ikryukov January 29, 2026 16:09

ikryukov suggested changes Feb 5, 2026

View reviewed changes

Juee14Desai force-pushed the master branch from a2a0145 to 191eefe Compare February 9, 2026 03:46

Juee14Desai requested a review from ikryukov February 10, 2026 01:04

janjust force-pushed the master branch from 191eefe to b1752c7 Compare February 12, 2026 16:29

Juee14Desai force-pushed the master branch from b1752c7 to 93211fa Compare February 19, 2026 09:01

ikryukov suggested changes Feb 19, 2026

View reviewed changes

.github/workflows/main.yaml Outdated Show resolved Hide resolved

Juee14Desai force-pushed the master branch 5 times, most recently from 31c35ba to bed3dc0 Compare February 20, 2026 00:50

Juee14Desai requested a review from ikryukov February 20, 2026 00:59

ikryukov added Ready-for-Review labels Feb 20, 2026

ikryukov assigned Juee14Desai and unassigned Juee14Desai Feb 20, 2026

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

ikryukov suggested changes Feb 20, 2026

View reviewed changes

.ci/scripts/run_tests_ucc_nvls.sh Outdated Show resolved Hide resolved

.ci/scripts/run_tests_ucc_nvls.sh Show resolved Hide resolved

.ci/scripts/run_tests_ucc_nvls.sh Show resolved Hide resolved

src/components/tl/cuda/kernels/nvls.cuh Show resolved Hide resolved

Juee14Desai force-pushed the master branch 2 times, most recently from a66fcc6 to 94fb9da Compare February 24, 2026 06:55

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

src/components/tl/cuda/kernels/nvls.cuh Show resolved Hide resolved

src/components/tl/cuda/kernels/allreduce_kernel.cu Show resolved Hide resolved

src/components/tl/cuda/kernels/allreduce_kernel.cu Show resolved Hide resolved

Juee14Desai force-pushed the master branch from 94fb9da to e3c4769 Compare February 24, 2026 20:53

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

Juee14Desai requested a review from ikryukov February 25, 2026 01:44

Juee14Desai force-pushed the master branch from e3c4769 to 325f618 Compare February 25, 2026 23:14

Juee14Desai force-pushed the master branch from 325f618 to c343153 Compare February 26, 2026 07:45

Juee14Desai force-pushed the master branch from c343153 to 27cfc07 Compare February 26, 2026 16:01

janjust requested a review from QiaoK February 26, 2026 16:08

QiaoK reviewed Mar 5, 2026

View reviewed changes

src/components/tl/cuda/kernels/allreduce_kernel.cu Outdated Show resolved Hide resolved

QiaoK reviewed Mar 5, 2026

View reviewed changes

src/components/tl/cuda/reduce_scatterv/reduce_scatterv_nvls.c Outdated Show resolved Hide resolved

QiaoK reviewed Mar 5, 2026

View reviewed changes

src/components/tl/cuda/kernels/reduce_scatter_kernel.cu Show resolved Hide resolved

Juee14Desai force-pushed the master branch from 27cfc07 to 4f80a4f Compare March 6, 2026 07:43

Juee14Desai requested a review from QiaoK March 6, 2026 07:54

Juee14Desai added 2 commits March 11, 2026 06:32

TL/CUDA: Add new data types into NVLS MPI CI

fe24e1b

This commit adds new data types and enables reduce scatter mpi test for nvls CI. Signed-off-by: Juee14Desai <jueehimalbha@nvidia.com>

Juee14Desai force-pushed the master branch from 4f80a4f to 77a2ace Compare March 11, 2026 05:30

Juee14Desai force-pushed the master branch from 77a2ace to 5dd8ef8 Compare March 11, 2026 17:20

Juee14Desai force-pushed the master branch from 5dd8ef8 to c1baf19 Compare March 11, 2026 18:30

Juee14Desai force-pushed the master branch from c1baf19 to 6632984 Compare March 11, 2026 19:46

Juee14Desai force-pushed the master branch from 6632984 to fe24e1b Compare March 11, 2026 21:23

Conversation

Juee14Desai commented Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janjust commented Feb 12, 2026

Uh oh!

Uh oh!

ikryukov commented Feb 20, 2026

Uh oh!

greptile-apps bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikryukov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ikryukov commented Feb 25, 2026

Uh oh!

Juee14Desai commented Feb 26, 2026

Uh oh!

Juee14Desai commented Feb 26, 2026

Uh oh!

ikryukov commented Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Uh oh!

Juee14Desai commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Juee14Desai commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Feb 20, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading