Conversation
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 9a724c9)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 48bf7b9)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit a6e10f8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit e633ad8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 154b291)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 30d1120)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit f08c6ac)
Signed-off-by: Emma Qiao <qqiao@nvidia.com> (cherry picked from commit f6db4ec)
… 3.6.0
Fix the following issue:
fused_moe_triton.py:177, in __call__
expt_scal, expt_indx, bitmatrix = topk(logits, ...)
TypeError: cannot unpack non-iterable SparseMatrix object
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit c17ec97)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 3744bc2)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 728b575)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 5326bd7)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 231a8dd)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit f2e6f9f)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 90f0202)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 10cf6ea)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 0fd1666)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit b8c4588)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 73d7d75)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit dc338b8)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4f7bcaf)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4468604)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit ec3012c)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 9c1a369)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4302b6f)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4ef86da)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 33a838e)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 3d535a9)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 6875d32)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit dff57cd)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit bb75bbb)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 922d93b)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 281cd69)
Signed-off-by: qqiao <qqiao@nvidia.com> (cherry picked from commit 09e42c9)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com> (cherry picked from commit 87521d1)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com> (cherry picked from commit 1ba0367)
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com> (cherry picked from commit 150fc1c)
(cherry picked from commit 57aabe8) Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
📝 WalkthroughWalkthroughThis pull request updates various dependencies and infrastructure components, including version bumps for CUDA, PyTorch, Triton, and TensorRT, alongside significant refactoring of the triton_kernels module. The refactoring introduces new distributed memory pooling, metadata computation kernels, and restructures routing and matmul operation handling, while removing legacy routing implementations. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (4)
.pre-commit-config.yaml (1)
1445-1445: Narrow the scope of theindxexception using a per-file or path-scoped mechanism.Since
indxonly appears in kernel files (triton_kernels/ and tensorrt_llm/_torch/modules/fused_moe/), replace the global-Lentry with either a.codespellignorefile with path patterns or per-file# codespell:ignore=indxdirectives. This preserves codespell coverage for the rest of the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.pre-commit-config.yaml at line 1445, The current pre-commit config adds a global codespell ignore entry "indx" in the args line, which suppresses checks across the repo; instead narrow scope by removing "indx" from the args list and add targeted ignores only where "indx" is legitimate: create a .codespellignore file (or per-directory rules) that includes path patterns for triton_kernels/ and tensorrt_llm/_torch/modules/fused_moe/, or insert per-file comments `# codespell:ignore=indx` in the specific kernel source files; update the args line (the "-L" list) to remove "indx" so codespell runs normally elsewhere and confirm codespell still skips only the intended kernel paths.triton_kernels/matmul_ogs.py (2)
176-176: Useassert Falseinstead ofassert None.Line 176 uses
assert Nonewhich always passes (sinceNoneis falsy, butassert Noneevaluates the truthiness of None, making it equivalent toassert False). However, the intent is clearer withassert Falseor raising an explicit exception.Suggested fix
- assert None + raise AssertionError("Unexpected data type for InnerRoutingData.make_kernel_args")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@triton_kernels/matmul_ogs.py` at line 176, Replace the ambiguous assertion "assert None" with an explicit failing assertion or exception: change it to "assert False, 'unexpected code path'" or raise an AssertionError/RuntimeError with a descriptive message so the failure is clear; locate the statement "assert None" in triton_kernels/matmul_ogs.py and update it to an explicit failure (use assert False or raise AssertionError("...")).
377-378: Parenthesize for clarity in boolean expression.The expression on line 377 mixes
andandoroperators. While Python's precedence handles this correctly (andbinds tighter thanor), adding parentheses improves readability.Suggested clarification
- if w_has_mx and (torch.cuda.get_device_capability()[0] < 10 or w.storage.layout is not None and not isinstance(w.storage.layout, StridedLayout)): + if w_has_mx and (torch.cuda.get_device_capability()[0] < 10 or (w.storage.layout is not None and not isinstance(w.storage.layout, StridedLayout))):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@triton_kernels/matmul_ogs.py` around lines 377 - 378, The boolean expression in the if-condition using w_has_mx mixes and/or and should be parenthesized for clarity: update the condition around the device-capability vs storage-layout check so the grouping is explicit (i.e. keep w_has_mx as the first conjunct and wrap the entire second part in parentheses, with an inner pair around the storage-layout check), referencing the same symbols (w_has_mx, torch.cuda.get_device_capability()[0], w.storage.layout, StridedLayout, and w.stride) so the intention is clear that w_has_mx AND (device cap < 10 OR (storage.layout is not None AND not isinstance(storage.layout, StridedLayout))) before asserting on w.stride(-2).tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (1)
177-180: Avoid directly mutating shape metadata; create a new shape list instead.Line 179 directly mutates
bitmatrix.mask.shape[-1]. Whileshapeis a mutable list (not a tuple), directly mutating shape metadata is poor practice and makes the code harder to reason about. After compaction, create a new shape list:new_shape = bitmatrix.mask.shape[:-1] + [n_expts_tot] bitmatrix.mask.shape = new_shapeOr use a helper method rather than direct list mutation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py` around lines 177 - 180, The code mutates shape metadata by assigning to bitmatrix.mask.shape[-1]; instead, build a new shape list and assign it back to bitmatrix.mask.shape (or use a helper) after compaction: compute n_expts_tot = slice_end - slice_start, create new_shape = bitmatrix.mask.shape[:-1] + [n_expts_tot], then set bitmatrix.mask.shape = new_shape (update this in the function that calls compaction and returns expt_scal, expt_indx, bitmatrix).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/common/install_tensorrt.sh`:
- Around line 5-13: The release-note URL in the comment should be updated to the
matching PyTorch container release notes for the new base (the one that
corresponds to CUDA_VER="13.1") and the CUBLAS pin should be set to the exact
value shown there; update the comment URL that currently references rel-25-12
and adjust CUBLAS_VER to match the upstream value (alongside confirming TRT_VER,
CUDA_VER, CUDNN_VER, and NCCL_VER still match the referenced release notes).
In `@docker/Dockerfile.multi`:
- Line 94: The release-stage removal uses a hardcoded artifact name
`wheel-0.45.1.dist-info` while the devel stage uses the glob
`wheel-*.dist-info`; update the release-stage rm command to use the same glob
pattern `wheel-*.dist-info` so both stages remove any wheel dist-info version
consistently (locate the second `rm -rf ...wheel-0.45.1.dist-info` line and
replace the hardcoded filename with the glob).
In `@triton_kernels/numerics_details/mxfp_details/_downcast_to_mxfp.py`:
- Around line 192-195: The masks currently use floor division (quant_dim //
K_DIVISOR and quant_dim // MXFP_BLOCK_SIZE) which drops any partial tail blocks;
change those comparisons to use ceiling division so the runtime bounds include a
final partial fp4 byte or scale block. Concretely, update the right-hand side of
the comparisons in mask_mxt_quant and scale_mask_k to use ceil(quant_dim /
K_DIVISOR) and ceil(quant_dim / MXFP_BLOCK_SIZE) (e.g., (quant_dim + K_DIVISOR -
1) // K_DIVISOR and (quant_dim + MXFP_BLOCK_SIZE - 1) // MXFP_BLOCK_SIZE) so
start_mx_quant + offs_mxt_quant and start_mx_scale_quant + offs_scale_quant will
allow the last partial block to be stored; keep the rest of the logic for
full_mask_mxt unchanged.
In `@triton_kernels/reduce.py`:
- Around line 199-214: The code currently treats a dimension as broadcasted only
when its stride is zero, but must also treat size-1 dimensions as broadcasted;
update the mask and scale handling so after reading mstr0,mstr1,mstr2 (and
sstr0,sstr1,sstr2) you also read the corresponding sizes (e.g., msz0,msz1,msz2
via mask.shape() and ssz0,ssz1,ssz2 via scale.shape()) and set
stride_mr/stride_m0/stride_m1 to 0 whenever the selected size == 1 (similarly
zero out stride_sr/stride_s0/stride_s1 when the selected scale size == 1),
otherwise keep the original stride values — apply this logic where stride_m* and
stride_s* are computed so broadcastable size-1 tensors are normalized the same
as stride==0.
- Around line 262-288: The crash is caused by using default-initialized
InFlexData()/OutFlexData() whose .scale, .actual_scale and .expected_scale
tensors are None; update reduce_torch to accept x_flex:
Optional[InFlexData]=None and y_flex: Optional[OutFlexData]=None (or keep
defaults but treat them as absent) and guard uses: only multiply by x_flex.scale
when x_flex is not None and x_flex.scale is a tensor, and only call
y_flex.actual_scale.copy_ and divide by y_flex.expected_scale when y_flex is not
None and those tensors are initialized (or initialize them inside reduce_torch
before use using compute_actual_scale), referencing reduce_torch, x_flex.scale,
y_flex.actual_scale, y_flex.expected_scale and compute_actual_scale to locate
the fixes.
In `@triton_kernels/roofline.py`:
- Around line 196-201: The current runtime checks in roofline.py use assert and
only accept int, which hides under python -O and rejects valid floats; replace
these with explicit type validation that accepts ints or floats (e.g.,
isinstance(max_tbps, (int, float)) and isinstance(max_tflops, (int, float))) and
raise a clear TypeError or ValueError when the value is neither numeric nor one
of the allowed sentinel strings ("memset" for max_tbps and "cublas" for
max_tflops); when the sentinel strings are provided, call get_memset_tbps() and
get_cublas_tflops(flops_dtype) as currently done (refer to max_tbps, max_tflops,
get_memset_tbps, get_cublas_tflops), but otherwise leave numeric values as-is to
avoid later cryptic arithmetic errors.
- Around line 76-85: The loop consumes intensity_proxy_values but
write_csv(intensity_proxy_values, perfs, out_path) later reuses it; if
intensity_proxy_values is a generator it will be exhausted and CSV/plotting will
fail—materialize it first (e.g., at start of the function convert
intensity_proxy_values to a list) before iterating and calling
inject_proxy_and_call so that write_csv and subsequent code receive the full
sequence; update references to intensity_proxy_values (used with
inject_proxy_and_call, perfs accumulation, and write_csv) accordingly.
- Around line 178-183: The validate_perfs function fails to check that each
series has the same length as xs_ref and can raise IndexError or miss
shorter-series mismatches; update validate_perfs (perfs, xs_ref, flops_ref,
bytes_ref) to first compare lengths (e.g., if len(xs) != len(xs_ref): raise
ValueError(f"length mismatch between series[0] and series[{series_idx}]"))
before iterating samples, and ensure the raised ValueError uses the enumerate
index (the series id from enumerate(perfs[1:], start=1)) in the message rather
than the sample index.
In `@triton_kernels/tensor_details/bitmatrix.py`:
- Around line 161-167: The inline lambda named pad inside
make_bitmatrix_metadata_torch triggers Ruff E731 (assigning a lambda to a name);
replace the lambda with a small local function def pad(x, total_size): ...
within the same scope of make_bitmatrix_metadata_torch (or move it to
module-level) and keep the same behavior (concatenate x with torch.full of -1 on
x.device to reach total_size) so col_sorted_indx and row_sorted_indx logic
remains unchanged; ensure the new function signature and use sites (pad(...))
match the previous lambda usage.
In `@triton_kernels/tensor_details/ragged_tensor.py`:
- Around line 93-98: Replace the inline lambdas in empty_aligned with small
local helper functions to satisfy Ruff E731: define a local def cdiv(x, y):
return (x + y - 1) // y and def pad(x): return cdiv(x, pad_size) * pad_size,
then use these helpers when computing pad(shape[-1]) and remove the lambda
assignments; keep ret, ret_slices and the returned slicing logic unchanged.
In `@triton_kernels/tensor.py`:
- Around line 187-189: The assert in Tensor.__post_init__ checks self.dtype
before the superclass backfills it, causing Bitmatrix(mask_tensor) to fail; fix
it by calling super().__post_init__ first to populate self.dtype and then assert
that self.dtype == BIT (i.e., reorder the two lines in Tensor.__post_init__ so
super().__post_init__ runs before the dtype assertion), ensuring you reference
Tensor.__post_init__ and the BIT constant when making the change.
---
Nitpick comments:
In @.pre-commit-config.yaml:
- Line 1445: The current pre-commit config adds a global codespell ignore entry
"indx" in the args line, which suppresses checks across the repo; instead narrow
scope by removing "indx" from the args list and add targeted ignores only where
"indx" is legitimate: create a .codespellignore file (or per-directory rules)
that includes path patterns for triton_kernels/ and
tensorrt_llm/_torch/modules/fused_moe/, or insert per-file comments `#
codespell:ignore=indx` in the specific kernel source files; update the args line
(the "-L" list) to remove "indx" so codespell runs normally elsewhere and
confirm codespell still skips only the intended kernel paths.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py`:
- Around line 177-180: The code mutates shape metadata by assigning to
bitmatrix.mask.shape[-1]; instead, build a new shape list and assign it back to
bitmatrix.mask.shape (or use a helper) after compaction: compute n_expts_tot =
slice_end - slice_start, create new_shape = bitmatrix.mask.shape[:-1] +
[n_expts_tot], then set bitmatrix.mask.shape = new_shape (update this in the
function that calls compaction and returns expt_scal, expt_indx, bitmatrix).
In `@triton_kernels/matmul_ogs.py`:
- Line 176: Replace the ambiguous assertion "assert None" with an explicit
failing assertion or exception: change it to "assert False, 'unexpected code
path'" or raise an AssertionError/RuntimeError with a descriptive message so the
failure is clear; locate the statement "assert None" in
triton_kernels/matmul_ogs.py and update it to an explicit failure (use assert
False or raise AssertionError("...")).
- Around line 377-378: The boolean expression in the if-condition using w_has_mx
mixes and/or and should be parenthesized for clarity: update the condition
around the device-capability vs storage-layout check so the grouping is explicit
(i.e. keep w_has_mx as the first conjunct and wrap the entire second part in
parentheses, with an inner pair around the storage-layout check), referencing
the same symbols (w_has_mx, torch.cuda.get_device_capability()[0],
w.storage.layout, StridedLayout, and w.stride) so the intention is clear that
w_has_mx AND (device cap < 10 OR (storage.layout is not None AND not
isinstance(storage.layout, StridedLayout))) before asserting on w.stride(-2).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ae89b124-e0d4-4950-87ed-1ee902707f34
📒 Files selected for processing (68)
.pre-commit-config.yamlATTRIBUTIONS-Python.mdREADME.mddocker/Dockerfile.multidocker/Makefiledocker/common/install_base.shdocker/common/install_cuda_toolkit.shdocker/common/install_pytorch.shdocker/common/install_tensorrt.shdocs/source/installation/build-from-source-linux.mddocs/source/installation/linux.mddocs/source/legacy/reference/support-matrix.mdjenkins/Build.groovyjenkins/L0_Test.groovyjenkins/current_image_tags.propertiesrequirements-dev.txtrequirements.txttensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/mxfp4_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.pytriton_kernels/README.mdtriton_kernels/VERSIONtriton_kernels/__init__.pytriton_kernels/compaction.pytriton_kernels/compaction_details/_masked_compaction.pytriton_kernels/distributed.pytriton_kernels/matmul_ogs.pytriton_kernels/matmul_ogs_details/_common.pytriton_kernels/matmul_ogs_details/_matmul_ogs.pytriton_kernels/matmul_ogs_details/_p_matmul_ogs.pytriton_kernels/matmul_ogs_details/_reduce_grouped.pytriton_kernels/matmul_ogs_details/opt_flags.pytriton_kernels/matmul_ogs_details/opt_flags_details/opt_flags_amd.pytriton_kernels/matmul_ogs_details/opt_flags_details/opt_flags_nvidia.pytriton_kernels/numerics.pytriton_kernels/numerics_details/__init__.pytriton_kernels/numerics_details/flexpoint.pytriton_kernels/numerics_details/mxfp.pytriton_kernels/numerics_details/mxfp_details/_downcast_to_mxfp.pytriton_kernels/numerics_details/mxfp_details/_upcast_from_mxfp.pytriton_kernels/proton_opts.pytriton_kernels/reduce.pytriton_kernels/roofline.pytriton_kernels/routing.pytriton_kernels/routing_details/__init__.pytriton_kernels/routing_details/_expt_data.pytriton_kernels/routing_details/_routing_compute.pytriton_kernels/specialize.pytriton_kernels/swiglu.pytriton_kernels/swiglu_details/_swiglu.pytriton_kernels/target_info.pytriton_kernels/tensor.pytriton_kernels/tensor_details/bitmatrix.pytriton_kernels/tensor_details/bitmatrix_details/__init__.pytriton_kernels/tensor_details/bitmatrix_details/sum_bitmatrix_rows.pytriton_kernels/tensor_details/layout.pytriton_kernels/tensor_details/layout_details/base.pytriton_kernels/tensor_details/layout_details/blackwell_scale.pytriton_kernels/tensor_details/layout_details/blackwell_value.pytriton_kernels/tensor_details/layout_details/cdna4_scale.pytriton_kernels/tensor_details/layout_details/hopper_scale.pytriton_kernels/tensor_details/layout_details/hopper_value.pytriton_kernels/tensor_details/layout_details/strided.pytriton_kernels/tensor_details/ragged_tensor.pytriton_kernels/testing.pytriton_kernels/topk.pytriton_kernels/topk_details/__init__.pytriton_kernels/topk_details/_topk_backward.pytriton_kernels/topk_details/_topk_forward.py
💤 Files with no reviewable changes (5)
- triton_kernels/routing_details/init.py
- triton_kernels/matmul_ogs_details/_reduce_grouped.py
- triton_kernels/routing_details/_expt_data.py
- triton_kernels/routing_details/_routing_compute.py
- triton_kernels/routing.py
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
|
/bot run --post-merge --disable-fail-fast |
|
PR_Github #38559 [ run ] triggered by Bot. Commit: |
| # https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html | ||
| NVRTC_VER="13.1.80-1" | ||
| NVRTC_VER="13.1.115-1" | ||
| CUDA_RUNTIME="13.1.80-1" |
There was a problem hiding this comment.
Update CUDA_RUNTIME="13.1.115-1"
There was a problem hiding this comment.
I checked pytorch 26.02 release notes(https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02) --> cuda tool kits 13.1.1 release notes(https://docs.nvidia.com/cuda/archive/13.1.1/cuda-toolkit-release-notes/index.html):
The cuda runtine version is 13.1.80
| ``` | ||
|
|
||
| > **Note:** The TensorRT LLM wheel on PyPI is built with PyTorch 2.9.1. This version may be incompatible with the NVIDIA NGC PyTorch 25.12 container, which uses a more recent PyTorch build from the main branch. If you are using this container or a similar environment, please install the pre-built wheel located at `/app/tensorrt_llm` inside the TensorRT LLM NGC Release container instead. | ||
| > **Note:** The TensorRT LLM wheel on PyPI is built with PyTorch 2.10.0. This version may be incompatible with the NVIDIA NGC PyTorch 25.12 container, which uses a more recent PyTorch build from the main branch. If you are using this container or a similar environment, please install the pre-built wheel located at `/app/tensorrt_llm` inside the TensorRT LLM NGC Release container instead. |
There was a problem hiding this comment.
Update to "NGC PyTorch 26.02 container"
There was a problem hiding this comment.
Thanks, let me leave this comment as a reminder!
We need to test this for 26.02 to check if it is still a problem.
|
PR_Github #38559 [ run ] completed with state
|
Summary by CodeRabbit
Release Notes
New Features
Dependency Updates
Bug Fixes
Documentation
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.