Skip to content

[None][infra] upgrade base 2602#12102

Open
EmmaQiaoCh wants to merge 42 commits intoNVIDIA:mainfrom
EmmaQiaoCh:emma/upgrade_base_2602
Open

[None][infra] upgrade base 2602#12102
EmmaQiaoCh wants to merge 42 commits intoNVIDIA:mainfrom
EmmaQiaoCh:emma/upgrade_base_2602

Conversation

@EmmaQiaoCh
Copy link
Collaborator

@EmmaQiaoCh EmmaQiaoCh commented Mar 11, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added distributed memory pooling and expert-parallel routing support for advanced model optimization
    • Introduced ragged tensor and sparse matrix constructs for improved data efficiency
    • Added roofline analysis utilities for performance benchmarking
  • Dependency Updates

    • Updated PyTorch from 2.9.1 to 2.10.0
    • Updated CUDA from 13.1.0 to 13.1.1
    • Updated TensorRT from 10.14.1 to 10.15.1
    • Updated Triton kernels from 3.5.1 to 3.6.0
    • Updated NCCL and other supporting libraries to compatible versions
  • Bug Fixes

    • Improved MXFP quantization precision and robustness
    • Enhanced bitmatrix operations and metadata handling
    • Fixed edge cases in reduction kernels
  • Documentation

    • Updated installation guides for PyTorch 2.10.0 compatibility

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

EmmaQiaoCh and others added 30 commits March 11, 2026 00:00
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 9a724c9)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 48bf7b9)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit a6e10f8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit e633ad8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 154b291)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 30d1120)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit f08c6ac)
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
(cherry picked from commit f6db4ec)
… 3.6.0

Fix the following issue:

fused_moe_triton.py:177, in __call__
    expt_scal, expt_indx, bitmatrix = topk(logits, ...)
TypeError: cannot unpack non-iterable SparseMatrix object

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit c17ec97)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 3744bc2)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 728b575)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 5326bd7)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 231a8dd)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit f2e6f9f)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 90f0202)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 10cf6ea)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 0fd1666)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit b8c4588)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 73d7d75)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit dc338b8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4f7bcaf)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4468604)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit ec3012c)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 9c1a369)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4302b6f)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4ef86da)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 33a838e)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 3d535a9)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 6875d32)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit dff57cd)
EmmaQiaoCh and others added 8 commits March 11, 2026 00:19
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit bb75bbb)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 922d93b)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 281cd69)
Signed-off-by: qqiao <qqiao@nvidia.com>
(cherry picked from commit 09e42c9)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
(cherry picked from commit 87521d1)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
(cherry picked from commit 1ba0367)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
(cherry picked from commit 150fc1c)
(cherry picked from commit 57aabe8)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

This pull request updates various dependencies and infrastructure components, including version bumps for CUDA, PyTorch, Triton, and TensorRT, alongside significant refactoring of the triton_kernels module. The refactoring introduces new distributed memory pooling, metadata computation kernels, and restructures routing and matmul operation handling, while removing legacy routing implementations.

Changes

Cohort / File(s) Summary
Version and Dependency Updates
requirements.txt, requirements-dev.txt, ATTRIBUTIONS-Python.md, triton_kernels/VERSION, triton_kernels/__init__.py, triton_kernels/compaction.py, triton_kernels/README.md, triton_kernels/numerics_details/__init__.py, triton_kernels/proton_opts.py, triton_kernels/target_info.py, triton_kernels/topk_details/__init__.py, triton_kernels/topk_details/_topk_backward.py, triton_kernels/tensor_details/layout_details/opt_flags_amd.py, triton_kernels/tensor_details/layout_details/strided.py, triton_kernels/tensor_details/layout_details/layout.py
Version bumps for Triton (3.5.1 → 3.6.0), PyTorch (2.9.1 → 2.10.0), TensorRT (~10.14.1 → ~10.15.1), CUDA (13.1.0 → 13.1.1), NCCL, and other dependencies. Added new dependencies (pydantic, prometheus_fastapi_instrumentator, datasets, etc.). Updated source references and documentation links to reflect new versions.
Docker and Build Configuration
docker/Dockerfile.multi, docker/Makefile, docker/common/install_base.sh, docker/common/install_cuda_toolkit.sh, docker/common/install_pytorch.sh, docker/common/install_tensorrt.sh, jenkins/current_image_tags.properties, jenkins/Build.groovy, jenkins/L0_Test.groovy
Updated Docker base image tags (25.12 → 26.02), CUDA toolkit versions, PyTorch versions, TensorRT component versions, and image references in Jenkins CI/CD pipelines. Adjusted package cleanup logic and installation constraints. Refactored test dependency handling for CUDA 13.0 stack (torch/torchvision) and added port allocation utilities via GlobalState class.
Documentation Updates
README.md, docs/source/installation/linux.md, docs/source/installation/build-from-source-linux.md, docs/source/legacy/reference/support-matrix.md, .pre-commit-config.yaml
Updated README badges and installation guides to reflect PyTorch 2.10.0 and CUDA 13.1.1. Added "indx" to codespell ignore list. Updated container compatibility versions in support matrix.
Triton Kernels: Distributed Memory and Routing Restructuring
triton_kernels/distributed.py, triton_kernels/routing.py, triton_kernels/routing_details/__init__.py, triton_kernels/routing_details/_expt_data.py, triton_kernels/routing_details/_routing_compute.py
Removed legacy Triton-based routing implementation (routing.py and routing_details/) and replaced with new distributed memory pool infrastructure in distributed.py, adding ExptAssignment, SymmetricMemoryPool, and utilities for expert partitioning and cross-rank tensor routing (convert_dp_to_ep, convert_ep_to_dp).
Triton Kernels: MatMul OGS and Specialization
triton_kernels/matmul_ogs.py, triton_kernels/matmul_ogs_details/_common.py, triton_kernels/matmul_ogs_details/_matmul_ogs.py, triton_kernels/matmul_ogs_details/_p_matmul_ogs.py, triton_kernels/matmul_ogs_details/_reduce_grouped.py, triton_kernels/matmul_ogs_details/opt_flags.py, triton_kernels/matmul_ogs_details/opt_flags_details/opt_flags_nvidia.py, triton_kernels/specialize.py
Restructured matmul_ogs to use new GatherIndx, ScatterIndx, RoutingData, InnerRoutingData, and FusedComm data classes. Introduced SpecializationModule for kernel variant caching. Removed reduce_grouped kernel and shifted to SpecializationModule-based approach. Extended matmul kernels with support for output accumulation, inner routing, fused scatter/gather, and distributed post-communication. Updated opt_flags with batch_size, x_transpose, has_y_acc_in parameters and removed fused_scatter field. Removed legacy reduce_grouped kernel file.
Triton Kernels: Numerics and Quantization
triton_kernels/numerics.py, triton_kernels/numerics_details/flexpoint.py, triton_kernels/numerics_details/mxfp.py, triton_kernels/numerics_details/mxfp_details/_downcast_to_mxfp.py, triton_kernels/numerics_details/mxfp_details/_upcast_from_mxfp.py
Added is_per_batch property to OutFlexData. Enhanced mxfp quantization/dequantization with alignment padding, descriptor-based kernel dispatch, RTNE rounding, and hardware-capability-gated FP4/FP8 conversion via inline PTX. Updated flexpoint clip to use tl.clamp. Added _get_max_power_of_2_quant_val and cuda_capability_geq import.
Triton Kernels: Tensor and Layout Restructuring
triton_kernels/tensor.py, triton_kernels/tensor_details/layout.py, triton_kernels/tensor_details/layout_details/base.py, triton_kernels/tensor_details/layout_details/hopper_scale.py, triton_kernels/tensor_details/layout_details/hopper_value.py, triton_kernels/tensor_details/layout_details/blackwell_value.py, triton_kernels/tensor_details/layout_details/blackwell_scale.py, triton_kernels/tensor_details/layout_details/cdna4_scale.py
Converted layout classes to dataclass-based with fields (mx_axis, num_warps, shape). Added BlackwellMXValueLayout and BlackwellMXScaleLayout, CDNA4MXScaleLayout with swizzle/unswizzle logic. Extended Tensor with dtype, shape, shape_max fields. Added RaggedTensor, SparseMatrix data structures with metadata utilities. Introduced layout-aware padding and descriptor-based operations. Updated make_default_matmul_mxfp4_w_layout to use BlackwellMXValueLayout for CUDA capability ≥10.
Triton Kernels: New Metadata and Reduction Modules
triton_kernels/tensor_details/bitmatrix.py, triton_kernels/tensor_details/bitmatrix_details/sum_bitmatrix_rows.py, triton_kernels/tensor_details/ragged_tensor.py, triton_kernels/reduce.py, triton_kernels/roofline.py, triton_kernels/topk.py, triton_kernels/topk_details/_topk_forward.py, triton_kernels/testing.py
Added new modules: bitmatrix.py with BitmatrixMetadata and stage-based kernel computation, ragged_tensor.py with RaggedTensorMetadata and remapping utilities, reduce.py with PostprocessFn-based reduction kernel and SpecializationModule wiring, roofline.py with performance analysis utilities. Updated topk.py to return SparseMatrix and support distributed all_gather mode with per-rank buffering. Enhanced sum_bitmatrix_rows with new output naming and updated wrapper API. Added per_batch_scale parameter to testing utilities. Updated _topk_forward signature to support multi-peer writes via PeerYvs, PeerYis, PeerBits arrays.
Triton Kernels: Auto-Deploy and Fused Operations
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py, triton_kernels/swiglu.py, triton_kernels/swiglu_details/_swiglu.py
Updated mxfp4_moe.py to import GatherIndx, ScatterIndx, RoutingData from matmul_ogs and use TritonEPRouter instead of routing function. Updated RouteFn type alias to return tuple of (RoutingData, GatherIndx, ScatterIndx). Modified fused_moe_triton.py to add _routing_clear_bitmatrix kernel, use bitmatrix.mask for routing workflow, support reduction_n parameter in FnSpecs for swiglu activations, and rebuild routing structures from updated metadata. Updated swiglu.py token data source from token_offs_raw to token_offs. Replaced clip logic in swiglu_details to use tl.clamp.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (4)
.pre-commit-config.yaml (1)

1445-1445: Narrow the scope of the indx exception using a per-file or path-scoped mechanism.

Since indx only appears in kernel files (triton_kernels/ and tensorrt_llm/_torch/modules/fused_moe/), replace the global -L entry with either a .codespellignore file with path patterns or per-file # codespell:ignore=indx directives. This preserves codespell coverage for the rest of the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.pre-commit-config.yaml at line 1445, The current pre-commit config adds a
global codespell ignore entry "indx" in the args line, which suppresses checks
across the repo; instead narrow scope by removing "indx" from the args list and
add targeted ignores only where "indx" is legitimate: create a .codespellignore
file (or per-directory rules) that includes path patterns for triton_kernels/
and tensorrt_llm/_torch/modules/fused_moe/, or insert per-file comments `#
codespell:ignore=indx` in the specific kernel source files; update the args line
(the "-L" list) to remove "indx" so codespell runs normally elsewhere and
confirm codespell still skips only the intended kernel paths.
triton_kernels/matmul_ogs.py (2)

176-176: Use assert False instead of assert None.

Line 176 uses assert None which always passes (since None is falsy, but assert None evaluates the truthiness of None, making it equivalent to assert False). However, the intent is clearer with assert False or raising an explicit exception.

Suggested fix
-            assert None
+            raise AssertionError("Unexpected data type for InnerRoutingData.make_kernel_args")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@triton_kernels/matmul_ogs.py` at line 176, Replace the ambiguous assertion
"assert None" with an explicit failing assertion or exception: change it to
"assert False, 'unexpected code path'" or raise an AssertionError/RuntimeError
with a descriptive message so the failure is clear; locate the statement "assert
None" in triton_kernels/matmul_ogs.py and update it to an explicit failure (use
assert False or raise AssertionError("...")).

377-378: Parenthesize for clarity in boolean expression.

The expression on line 377 mixes and and or operators. While Python's precedence handles this correctly (and binds tighter than or), adding parentheses improves readability.

Suggested clarification
-    if w_has_mx and (torch.cuda.get_device_capability()[0] < 10 or w.storage.layout is not None and not isinstance(w.storage.layout, StridedLayout)):
+    if w_has_mx and (torch.cuda.get_device_capability()[0] < 10 or (w.storage.layout is not None and not isinstance(w.storage.layout, StridedLayout))):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@triton_kernels/matmul_ogs.py` around lines 377 - 378, The boolean expression
in the if-condition using w_has_mx mixes and/or and should be parenthesized for
clarity: update the condition around the device-capability vs storage-layout
check so the grouping is explicit (i.e. keep w_has_mx as the first conjunct and
wrap the entire second part in parentheses, with an inner pair around the
storage-layout check), referencing the same symbols (w_has_mx,
torch.cuda.get_device_capability()[0], w.storage.layout, StridedLayout, and
w.stride) so the intention is clear that w_has_mx AND (device cap < 10 OR
(storage.layout is not None AND not isinstance(storage.layout, StridedLayout)))
before asserting on w.stride(-2).
tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)

177-180: Avoid directly mutating shape metadata; create a new shape list instead.

Line 179 directly mutates bitmatrix.mask.shape[-1]. While shape is a mutable list (not a tuple), directly mutating shape metadata is poor practice and makes the code harder to reason about. After compaction, create a new shape list:

new_shape = bitmatrix.mask.shape[:-1] + [n_expts_tot]
bitmatrix.mask.shape = new_shape

Or use a helper method rather than direct list mutation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py` around lines 177 -
180, The code mutates shape metadata by assigning to bitmatrix.mask.shape[-1];
instead, build a new shape list and assign it back to bitmatrix.mask.shape (or
use a helper) after compaction: compute n_expts_tot = slice_end - slice_start,
create new_shape = bitmatrix.mask.shape[:-1] + [n_expts_tot], then set
bitmatrix.mask.shape = new_shape (update this in the function that calls
compaction and returns expt_scal, expt_indx, bitmatrix).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/common/install_tensorrt.sh`:
- Around line 5-13: The release-note URL in the comment should be updated to the
matching PyTorch container release notes for the new base (the one that
corresponds to CUDA_VER="13.1") and the CUBLAS pin should be set to the exact
value shown there; update the comment URL that currently references rel-25-12
and adjust CUBLAS_VER to match the upstream value (alongside confirming TRT_VER,
CUDA_VER, CUDNN_VER, and NCCL_VER still match the referenced release notes).

In `@docker/Dockerfile.multi`:
- Line 94: The release-stage removal uses a hardcoded artifact name
`wheel-0.45.1.dist-info` while the devel stage uses the glob
`wheel-*.dist-info`; update the release-stage rm command to use the same glob
pattern `wheel-*.dist-info` so both stages remove any wheel dist-info version
consistently (locate the second `rm -rf ...wheel-0.45.1.dist-info` line and
replace the hardcoded filename with the glob).

In `@triton_kernels/numerics_details/mxfp_details/_downcast_to_mxfp.py`:
- Around line 192-195: The masks currently use floor division (quant_dim //
K_DIVISOR and quant_dim // MXFP_BLOCK_SIZE) which drops any partial tail blocks;
change those comparisons to use ceiling division so the runtime bounds include a
final partial fp4 byte or scale block. Concretely, update the right-hand side of
the comparisons in mask_mxt_quant and scale_mask_k to use ceil(quant_dim /
K_DIVISOR) and ceil(quant_dim / MXFP_BLOCK_SIZE) (e.g., (quant_dim + K_DIVISOR -
1) // K_DIVISOR and (quant_dim + MXFP_BLOCK_SIZE - 1) // MXFP_BLOCK_SIZE) so
start_mx_quant + offs_mxt_quant and start_mx_scale_quant + offs_scale_quant will
allow the last partial block to be stored; keep the rest of the logic for
full_mask_mxt unchanged.

In `@triton_kernels/reduce.py`:
- Around line 199-214: The code currently treats a dimension as broadcasted only
when its stride is zero, but must also treat size-1 dimensions as broadcasted;
update the mask and scale handling so after reading mstr0,mstr1,mstr2 (and
sstr0,sstr1,sstr2) you also read the corresponding sizes (e.g., msz0,msz1,msz2
via mask.shape() and ssz0,ssz1,ssz2 via scale.shape()) and set
stride_mr/stride_m0/stride_m1 to 0 whenever the selected size == 1 (similarly
zero out stride_sr/stride_s0/stride_s1 when the selected scale size == 1),
otherwise keep the original stride values — apply this logic where stride_m* and
stride_s* are computed so broadcastable size-1 tensors are normalized the same
as stride==0.
- Around line 262-288: The crash is caused by using default-initialized
InFlexData()/OutFlexData() whose .scale, .actual_scale and .expected_scale
tensors are None; update reduce_torch to accept x_flex:
Optional[InFlexData]=None and y_flex: Optional[OutFlexData]=None (or keep
defaults but treat them as absent) and guard uses: only multiply by x_flex.scale
when x_flex is not None and x_flex.scale is a tensor, and only call
y_flex.actual_scale.copy_ and divide by y_flex.expected_scale when y_flex is not
None and those tensors are initialized (or initialize them inside reduce_torch
before use using compute_actual_scale), referencing reduce_torch, x_flex.scale,
y_flex.actual_scale, y_flex.expected_scale and compute_actual_scale to locate
the fixes.

In `@triton_kernels/roofline.py`:
- Around line 196-201: The current runtime checks in roofline.py use assert and
only accept int, which hides under python -O and rejects valid floats; replace
these with explicit type validation that accepts ints or floats (e.g.,
isinstance(max_tbps, (int, float)) and isinstance(max_tflops, (int, float))) and
raise a clear TypeError or ValueError when the value is neither numeric nor one
of the allowed sentinel strings ("memset" for max_tbps and "cublas" for
max_tflops); when the sentinel strings are provided, call get_memset_tbps() and
get_cublas_tflops(flops_dtype) as currently done (refer to max_tbps, max_tflops,
get_memset_tbps, get_cublas_tflops), but otherwise leave numeric values as-is to
avoid later cryptic arithmetic errors.
- Around line 76-85: The loop consumes intensity_proxy_values but
write_csv(intensity_proxy_values, perfs, out_path) later reuses it; if
intensity_proxy_values is a generator it will be exhausted and CSV/plotting will
fail—materialize it first (e.g., at start of the function convert
intensity_proxy_values to a list) before iterating and calling
inject_proxy_and_call so that write_csv and subsequent code receive the full
sequence; update references to intensity_proxy_values (used with
inject_proxy_and_call, perfs accumulation, and write_csv) accordingly.
- Around line 178-183: The validate_perfs function fails to check that each
series has the same length as xs_ref and can raise IndexError or miss
shorter-series mismatches; update validate_perfs (perfs, xs_ref, flops_ref,
bytes_ref) to first compare lengths (e.g., if len(xs) != len(xs_ref): raise
ValueError(f"length mismatch between series[0] and series[{series_idx}]"))
before iterating samples, and ensure the raised ValueError uses the enumerate
index (the series id from enumerate(perfs[1:], start=1)) in the message rather
than the sample index.

In `@triton_kernels/tensor_details/bitmatrix.py`:
- Around line 161-167: The inline lambda named pad inside
make_bitmatrix_metadata_torch triggers Ruff E731 (assigning a lambda to a name);
replace the lambda with a small local function def pad(x, total_size): ...
within the same scope of make_bitmatrix_metadata_torch (or move it to
module-level) and keep the same behavior (concatenate x with torch.full of -1 on
x.device to reach total_size) so col_sorted_indx and row_sorted_indx logic
remains unchanged; ensure the new function signature and use sites (pad(...))
match the previous lambda usage.

In `@triton_kernels/tensor_details/ragged_tensor.py`:
- Around line 93-98: Replace the inline lambdas in empty_aligned with small
local helper functions to satisfy Ruff E731: define a local def cdiv(x, y):
return (x + y - 1) // y and def pad(x): return cdiv(x, pad_size) * pad_size,
then use these helpers when computing pad(shape[-1]) and remove the lambda
assignments; keep ret, ret_slices and the returned slicing logic unchanged.

In `@triton_kernels/tensor.py`:
- Around line 187-189: The assert in Tensor.__post_init__ checks self.dtype
before the superclass backfills it, causing Bitmatrix(mask_tensor) to fail; fix
it by calling super().__post_init__ first to populate self.dtype and then assert
that self.dtype == BIT (i.e., reorder the two lines in Tensor.__post_init__ so
super().__post_init__ runs before the dtype assertion), ensuring you reference
Tensor.__post_init__ and the BIT constant when making the change.

---

Nitpick comments:
In @.pre-commit-config.yaml:
- Line 1445: The current pre-commit config adds a global codespell ignore entry
"indx" in the args line, which suppresses checks across the repo; instead narrow
scope by removing "indx" from the args list and add targeted ignores only where
"indx" is legitimate: create a .codespellignore file (or per-directory rules)
that includes path patterns for triton_kernels/ and
tensorrt_llm/_torch/modules/fused_moe/, or insert per-file comments `#
codespell:ignore=indx` in the specific kernel source files; update the args line
(the "-L" list) to remove "indx" so codespell runs normally elsewhere and
confirm codespell still skips only the intended kernel paths.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py`:
- Around line 177-180: The code mutates shape metadata by assigning to
bitmatrix.mask.shape[-1]; instead, build a new shape list and assign it back to
bitmatrix.mask.shape (or use a helper) after compaction: compute n_expts_tot =
slice_end - slice_start, create new_shape = bitmatrix.mask.shape[:-1] +
[n_expts_tot], then set bitmatrix.mask.shape = new_shape (update this in the
function that calls compaction and returns expt_scal, expt_indx, bitmatrix).

In `@triton_kernels/matmul_ogs.py`:
- Line 176: Replace the ambiguous assertion "assert None" with an explicit
failing assertion or exception: change it to "assert False, 'unexpected code
path'" or raise an AssertionError/RuntimeError with a descriptive message so the
failure is clear; locate the statement "assert None" in
triton_kernels/matmul_ogs.py and update it to an explicit failure (use assert
False or raise AssertionError("...")).
- Around line 377-378: The boolean expression in the if-condition using w_has_mx
mixes and/or and should be parenthesized for clarity: update the condition
around the device-capability vs storage-layout check so the grouping is explicit
(i.e. keep w_has_mx as the first conjunct and wrap the entire second part in
parentheses, with an inner pair around the storage-layout check), referencing
the same symbols (w_has_mx, torch.cuda.get_device_capability()[0],
w.storage.layout, StridedLayout, and w.stride) so the intention is clear that
w_has_mx AND (device cap < 10 OR (storage.layout is not None AND not
isinstance(storage.layout, StridedLayout))) before asserting on w.stride(-2).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ae89b124-e0d4-4950-87ed-1ee902707f34

📥 Commits

Reviewing files that changed from the base of the PR and between f7255e0 and 526017b.

📒 Files selected for processing (68)
  • .pre-commit-config.yaml
  • ATTRIBUTIONS-Python.md
  • README.md
  • docker/Dockerfile.multi
  • docker/Makefile
  • docker/common/install_base.sh
  • docker/common/install_cuda_toolkit.sh
  • docker/common/install_pytorch.sh
  • docker/common/install_tensorrt.sh
  • docs/source/installation/build-from-source-linux.md
  • docs/source/installation/linux.md
  • docs/source/legacy/reference/support-matrix.md
  • jenkins/Build.groovy
  • jenkins/L0_Test.groovy
  • jenkins/current_image_tags.properties
  • requirements-dev.txt
  • requirements.txt
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py
  • triton_kernels/README.md
  • triton_kernels/VERSION
  • triton_kernels/__init__.py
  • triton_kernels/compaction.py
  • triton_kernels/compaction_details/_masked_compaction.py
  • triton_kernels/distributed.py
  • triton_kernels/matmul_ogs.py
  • triton_kernels/matmul_ogs_details/_common.py
  • triton_kernels/matmul_ogs_details/_matmul_ogs.py
  • triton_kernels/matmul_ogs_details/_p_matmul_ogs.py
  • triton_kernels/matmul_ogs_details/_reduce_grouped.py
  • triton_kernels/matmul_ogs_details/opt_flags.py
  • triton_kernels/matmul_ogs_details/opt_flags_details/opt_flags_amd.py
  • triton_kernels/matmul_ogs_details/opt_flags_details/opt_flags_nvidia.py
  • triton_kernels/numerics.py
  • triton_kernels/numerics_details/__init__.py
  • triton_kernels/numerics_details/flexpoint.py
  • triton_kernels/numerics_details/mxfp.py
  • triton_kernels/numerics_details/mxfp_details/_downcast_to_mxfp.py
  • triton_kernels/numerics_details/mxfp_details/_upcast_from_mxfp.py
  • triton_kernels/proton_opts.py
  • triton_kernels/reduce.py
  • triton_kernels/roofline.py
  • triton_kernels/routing.py
  • triton_kernels/routing_details/__init__.py
  • triton_kernels/routing_details/_expt_data.py
  • triton_kernels/routing_details/_routing_compute.py
  • triton_kernels/specialize.py
  • triton_kernels/swiglu.py
  • triton_kernels/swiglu_details/_swiglu.py
  • triton_kernels/target_info.py
  • triton_kernels/tensor.py
  • triton_kernels/tensor_details/bitmatrix.py
  • triton_kernels/tensor_details/bitmatrix_details/__init__.py
  • triton_kernels/tensor_details/bitmatrix_details/sum_bitmatrix_rows.py
  • triton_kernels/tensor_details/layout.py
  • triton_kernels/tensor_details/layout_details/base.py
  • triton_kernels/tensor_details/layout_details/blackwell_scale.py
  • triton_kernels/tensor_details/layout_details/blackwell_value.py
  • triton_kernels/tensor_details/layout_details/cdna4_scale.py
  • triton_kernels/tensor_details/layout_details/hopper_scale.py
  • triton_kernels/tensor_details/layout_details/hopper_value.py
  • triton_kernels/tensor_details/layout_details/strided.py
  • triton_kernels/tensor_details/ragged_tensor.py
  • triton_kernels/testing.py
  • triton_kernels/topk.py
  • triton_kernels/topk_details/__init__.py
  • triton_kernels/topk_details/_topk_backward.py
  • triton_kernels/topk_details/_topk_forward.py
💤 Files with no reviewable changes (5)
  • triton_kernels/routing_details/init.py
  • triton_kernels/matmul_ogs_details/_reduce_grouped.py
  • triton_kernels/routing_details/_expt_data.py
  • triton_kernels/routing_details/_routing_compute.py
  • triton_kernels/routing.py

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38559 [ run ] triggered by Bot. Commit: 1f83bdf Link to invocation

# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
NVRTC_VER="13.1.80-1"
NVRTC_VER="13.1.115-1"
CUDA_RUNTIME="13.1.80-1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update CUDA_RUNTIME="13.1.115-1"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked pytorch 26.02 release notes(https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02) --> cuda tool kits 13.1.1 release notes(https://docs.nvidia.com/cuda/archive/13.1.1/cuda-toolkit-release-notes/index.html):
The cuda runtine version is 13.1.80

```

> **Note:** The TensorRT LLM wheel on PyPI is built with PyTorch 2.9.1. This version may be incompatible with the NVIDIA NGC PyTorch 25.12 container, which uses a more recent PyTorch build from the main branch. If you are using this container or a similar environment, please install the pre-built wheel located at `/app/tensorrt_llm` inside the TensorRT LLM NGC Release container instead.
> **Note:** The TensorRT LLM wheel on PyPI is built with PyTorch 2.10.0. This version may be incompatible with the NVIDIA NGC PyTorch 25.12 container, which uses a more recent PyTorch build from the main branch. If you are using this container or a similar environment, please install the pre-built wheel located at `/app/tensorrt_llm` inside the TensorRT LLM NGC Release container instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to "NGC PyTorch 26.02 container"

Copy link
Collaborator Author

@EmmaQiaoCh EmmaQiaoCh Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me leave this comment as a reminder!
We need to test this for 26.02 to check if it is still a problem.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38559 [ run ] completed with state SUCCESS. Commit: 1f83bdf
/LLM/main/L0_MergeRequest_PR pipeline #29901 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants