Skip to content

[None][feat] Add shared expert LoRA support for MoE models in PyTorch backend#11760

Merged
achartier merged 8 commits intoNVIDIA:mainfrom
achartier:shared-expert-moe
Mar 12, 2026
Merged

[None][feat] Add shared expert LoRA support for MoE models in PyTorch backend#11760
achartier merged 8 commits intoNVIDIA:mainfrom
achartier:shared-expert-moe

Conversation

@achartier
Copy link
Collaborator

@achartier achartier commented Feb 26, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added LoRA support for shared expert modules in Mixture of Experts (MoE) models, enabling fine-tuning of shared expert components in Llama and Qwen MoE variants.
  • Bug Fixes

    • Fixed typo in error message ("too few" instead of "to few").
  • Tests

    • Added test for Qwen MoE shared expert LoRA functionality.

Description

Adds dedicated shared_expert_* LoRA module types so that MoE models with shared experts (Qwen MoE, Qwen3Next, Llama4 - only Qwen MoE is tested) can correctly apply LoRA adapters in the PyTorch backend. Previously, shared expert LoRA failed with a C++ dimension validation error because a single mlpHiddenSize was used for all MLP-type LoRA modules, but MoE models have up to three different intermediate sizes (dense MLP, shared expert, routed experts).

The shared expert intermediate size is inferred from the actual LoRA adapter weights (_infer_shared_expert_size_from_adapter), not the model config. This is necessary because some adapters (e.g., upcycled models) have a shared_expert_intermediate_size in the config that doesn't match the actual adapter weight dimensions. The function scans for shared_expert.down_proj.lora_A in the adapter, reads the LoRA rank from adapter_config.json, and extracts the intermediate size from the non-rank dimension of the weight tensor.

When a model has a shared expert (shared_expert_hidden_size > 0), user-specified mlp_* target modules are automatically replaced with shared_expert_* equivalents for pure MoE models (no mlp_only_layers). This would need to be updated to support mixed (dense+MoE) models.

Test Coverage

tests/unittest/llmapi/test_llm_pytorch.py::test_qwen_moe_shared_expert_lora

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@achartier
Copy link
Collaborator Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

This PR adds LoRA (Low-Rank Adaptation) support for shared expert layers in Mixture of Experts models. It introduces three new module types for shared experts, extends the LoRA module creation API with sizing parameters (sharedExpertHiddenSize, moeHiddenSize), and threads lora_params through forward paths in PyTorch models (Llama, Qwen variants). It also adds inference logic to derive shared expert sizing from adapters and updates module mappings and type handling accordingly.

Changes

Cohort / File(s) Summary
C++ LoRA Module Core
cpp/include/tensorrt_llm/runtime/loraModule.h, cpp/tensorrt_llm/runtime/loraModule.cpp
Added three new ModuleType enum values (kSHARED_EXPERT_H_TO_4H, kSHARED_EXPERT_4H_TO_H, kSHARED_EXPERT_GATE) and extended createLoraModules signature with sharedExpertHiddenSize and moeHiddenSize parameters. Updated dimension calculations to use these new sizing parameters for shared and MoE layers.
Python Bindings
cpp/tensorrt_llm/nanobind/bindings.cpp
Updated Nanobind binding for create_lora_modules to accept two new optional parameters (shared_expert_hidden_size, moe_hidden_size) with default value 0.
LoRA String Mappings
tensorrt_llm/lora_helper.py
Added three mappings from TensorRT-LLM module names to HuggingFace equivalents for shared expert modules (shared_expert_h_to_4h, shared_expert_4h_to_h, shared_expert_gate).
LoRA Manager & Utilities
tensorrt_llm/lora_manager.py, cpp/tensorrt_llm/runtime/loraUtils.cpp
Extended LORA_MODULE_IDS with three new shared expert module IDs. Replaced strict HF module assertion with fallback logic that attempts suffix matching and logs warnings for unsupported modules. Fixed typo in error message.
PyTorch LoRA Layer Types
tensorrt_llm/_torch/peft/lora/layer.py
Added three new enum members to LoraModuleType: SHARED_EXPERT_H_TO_4H, SHARED_EXPERT_4H_TO_H, and SHARED_EXPERT_GATE.
PyTorch Gated MLP
tensorrt_llm/_torch/modules/gated_mlp.py
Added is_shared_expert parameter to GatedMLP.init. Conditional logic now selects between SHARED_EXPERT_* and MLP_* LoRA module types based on is_shared_expert flag.
Llama Model Variants
tensorrt_llm/_torch/models/modeling_llama.py, tensorrt_llm/_torch/models/modeling_llama_min_latency.py
Added lora_params parameter threading through Llama4Attention, Llama4MoE, and corresponding min-latency variants. Extended shared_expert initialization with layer_idx and is_shared_expert flags; min-latency variant now accepts and propagates layer_idx through constructors.
Qwen Model Variants
tensorrt_llm/_torch/models/modeling_qwen3_next.py, tensorrt_llm/_torch/models/modeling_qwen_moe.py
Added lora_params parameter to forward signatures in Qwen3NextGatedDeltaNet, Qwen3NextSparseMoeBlock, Qwen3NextFullAttentionDecoderLayer, Qwen3NextModel, and QwenMoE/QwenMoeDecoderLayer/QwenMoeModel variants. Shared expert initialization configured with is_shared_expert=True and layer_idx.
PyTorch Executor Utilities
tensorrt_llm/_torch/pyexecutor/_util.py
Added _infer_shared_expert_size_from_adapter helper to derive shared expert sizing from adapter weights. Updated LoRA module creation logic to detect MoE/shared-expert configurations, replace MLP targets with shared expert targets, and pass shared_expert_hidden_size and moe_hidden_size parameters to LoraModule.create_lora_modules.
Tests
tests/unittest/llmapi/test_llm_pytorch.py
Added test_qwen_moe_shared_expert_lora test function to verify LoRA application on Qwen1.5-MoE shared expert layers, asserting token and log-probability differences when LoRA is enabled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding shared expert LoRA support for MoE models in PyTorch backend, which aligns with the core changes across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description comprehensively explains the problem, solution, test coverage, and includes completed checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
cpp/tensorrt_llm/nanobind/bindings.cpp (1)

199-220: ⚠️ Potential issue | 🟠 Major

Expose shared-expert module enums in nanobind too.

create_lora_modules now supports shared-expert sizing, but Python binding for LoraModuleType still does not include SHARED_EXPERT_*. That leaves the public API inconsistent with runtime types.

🔧 Proposed fix
     nb::enum_<tr::LoraModule::ModuleType>(m, "LoraModuleType")
         .value("INVALID", tr::LoraModule::ModuleType::kINVALID)
         .value("ATTN_QKV", tr::LoraModule::ModuleType::kATTN_QKV)
         .value("ATTN_Q", tr::LoraModule::ModuleType::kATTN_Q)
         .value("ATTN_K", tr::LoraModule::ModuleType::kATTN_K)
         .value("ATTN_V", tr::LoraModule::ModuleType::kATTN_V)
         .value("ATTN_DENSE", tr::LoraModule::ModuleType::kATTN_DENSE)
         .value("MLP_H_TO_4H", tr::LoraModule::ModuleType::kMLP_H_TO_4H)
         .value("MLP_4H_TO_H", tr::LoraModule::ModuleType::kMLP_4H_TO_H)
         .value("MLP_GATE", tr::LoraModule::ModuleType::kMLP_GATE)
         .value("CROSS_ATTN_QKV", tr::LoraModule::ModuleType::kCROSS_ATTN_QKV)
         .value("CROSS_ATTN_Q", tr::LoraModule::ModuleType::kCROSS_ATTN_Q)
         .value("CROSS_ATTN_K", tr::LoraModule::ModuleType::kCROSS_ATTN_K)
         .value("CROSS_ATTN_V", tr::LoraModule::ModuleType::kCROSS_ATTN_V)
         .value("CROSS_ATTN_DENSE", tr::LoraModule::ModuleType::kCROSS_ATTN_DENSE)
         .value("MOE_H_TO_4H", tr::LoraModule::ModuleType::kMOE_H_TO_4H)
         .value("MOE_4H_TO_H", tr::LoraModule::ModuleType::kMOE_4H_TO_H)
         .value("MOE_GATE", tr::LoraModule::ModuleType::kMOE_GATE)
         .value("MOE_ROUTER", tr::LoraModule::ModuleType::kMOE_ROUTER)
         .value("MLP_ROUTER", tr::LoraModule::ModuleType::kMLP_ROUTER)
-        .value("MLP_GATE_UP", tr::LoraModule::ModuleType::kMLP_GATE_UP);
+        .value("MLP_GATE_UP", tr::LoraModule::ModuleType::kMLP_GATE_UP)
+        .value("SHARED_EXPERT_H_TO_4H", tr::LoraModule::ModuleType::kSHARED_EXPERT_H_TO_4H)
+        .value("SHARED_EXPERT_4H_TO_H", tr::LoraModule::ModuleType::kSHARED_EXPERT_4H_TO_H)
+        .value("SHARED_EXPERT_GATE", tr::LoraModule::ModuleType::kSHARED_EXPERT_GATE);

Also applies to: 235-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/bindings.cpp` around lines 199 - 220, The Python
nanobind enum for tr::LoraModule::ModuleType is missing the SHARED_EXPERT_*
variants, causing the public API to differ from runtime types; update the
nb::enum_ binding in bindings.cpp (the block that defines "LoraModuleType") to
add .value(...) entries for each SHARED_EXPERT variant present in
tr::LoraModule::ModuleType (use the exact enum identifiers from
tr::LoraModule::ModuleType, e.g., kSHARED_EXPERT_... as defined in the C++ enum)
so they map to names like "SHARED_EXPERT_H_TO_4H", "SHARED_EXPERT_4H_TO_H",
"SHARED_EXPERT_GATE", etc., and make the same additions in the second enum
declaration mentioned (around the other occurrence) so create_lora_modules and
the nanobind enum stay consistent.
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2)

609-624: ⚠️ Potential issue | 🟠 Major

lora_params support is added in MoE but not wired from min-latency decoder

Llama4MinLatencyMoE.forward now accepts lora_params (Line 617), but Llama4MinLatencyDecoderLayer.forward does not pass it to self.feed_forward(...). In min-latency mode, LoRA can be dropped silently.

💡 Suggested fix
 class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer):
@@
     def forward(
         self,
         position_ids: torch.LongTensor,
         hidden_states: Union[torch.Tensor, Fp4QuantizedTensor],
         attn_metadata: AttentionMetadata,
         residual: Optional[torch.Tensor],
         spec_metadata: Optional[SpecMetadata] = None,
         **kwargs,
     ) -> torch.Tensor:
+        lora_params = kwargs.get("lora_params")
@@
         if self.is_mlp_layer:
             hidden_states = self.feed_forward(
                 hidden_states,
                 all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
                 final_all_reduce_params=AllReduceParams(enable_allreduce=not (
                     self.fusion_config.POST_MLP_FUSION
                     or self.mapping.tp_size == 1 or self.enable_attention_dp)),
+                lora_params=lora_params,
             )
         else:
             hidden_states = self.feed_forward(
                 hidden_states,
                 all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
                 final_all_reduce_params=AllReduceParams(enable_allreduce=not (
                     self.fusion_config.POST_MOE_FUSION
                     or self.mapping.tp_size == 1 or self.enable_attention_dp)),
                 hidden_states_high=hidden_states_high,
+                lora_params=lora_params,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 609 -
624, Llama4MinLatencyMoE.forward accepts lora_params but
Llama4MinLatencyDecoderLayer.forward doesn't forward it into self.feed_forward,
causing LoRA to be ignored; update Llama4MinLatencyDecoderLayer.forward to
accept an optional lora_params argument (or propagate an existing one) and pass
it into the call to self.feed_forward (e.g., self.feed_forward(...,
lora_params=lora_params)), and ensure callers that invoke
Llama4MinLatencyDecoderLayer.forward (including wherever the decoder is invoked
from the min-latency decoder/decoder stack) propagate the lora_params from
Llama4MinLatencyMoE.forward so the MoE path and regular feed_forward both
receive the LoRA params.

513-555: ⚠️ Potential issue | 🟠 Major

layer_idx was added but is not consistently propagated in min-latency construction path

Llama4MinLatencyMoE now takes layer_idx and forwards it into shared-expert internals, but its creation path in Llama4MinLatencyDecoderLayer omits layer_idx, leaving the new per-layer context unset.

💡 Suggested fix
         else:
             self.feed_forward = Llama4MinLatencyMoE(
                 num_experts=config.num_local_experts,
                 top_k=config.num_experts_per_tok,
                 hidden_size=config.hidden_size,
                 intermediate_size=config.intermediate_size,
                 shared_expert_intermediate_size=config.intermediate_size,
                 model_config=model_config,
                 aux_stream=aux_stream,
-                dtype=config.torch_dtype)
+                dtype=config.torch_dtype,
+                layer_idx=layer_idx)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 513 -
555, The decoder-layer construction path must pass the per-layer index into the
min-latency MoE so the shared-expert gets correct context: update the place
where Llama4MinLatencyDecoderLayer instantiates Llama4MinLatencyMoE to forward
the layer index (e.g., pass layer_idx=self.layer_idx or layer_idx=layer_idx)
into the Llama4MinLatencyMoE constructor; ensure any other internal min-latency
submodule creations in Llama4MinLatencyDecoderLayer (such as shared_expert or
calls that create Llama4MinLatencyGatedMLP) also receive the same layer_idx so
the per-layer context is consistently set.
tensorrt_llm/_torch/models/modeling_llama.py (1)

477-549: ⚠️ Potential issue | 🔴 Critical

Llama4DecoderLayer.forward uses lora_params without declaring it

At line 548, lora_params is referenced but not defined in the function scope. This will cause a NameError at runtime. Add lora_params: Optional[dict] = None to the function signature, consistent with the signatures of Llama4Attention.forward and Llama4MoE.forward in the same file.

Suggested fix
     def forward(
         self,
         position_ids: torch.IntTensor,
         hidden_states: Union[torch.Tensor, Fp4QuantizedTensor],
         attn_metadata: AttentionMetadata,
         residual: Optional[torch.Tensor],
         spec_metadata: Optional[SpecMetadata] = None,
+        lora_params: Optional[dict] = None,
         **kwargs,
     ) -> torch.Tensor:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama.py` around lines 477 - 549, The
forward method Llama4DecoderLayer.forward references lora_params but doesn't
declare it; add lora_params: Optional[dict] = None to the forward signature
(matching Llama4Attention.forward and Llama4MoE.forward) so the symbol is
defined, update the type imports if needed, and ensure any internal calls that
forward lora_params (e.g., the final self.feed_forward(...) call) keep passing
this parameter through; also scan and update callers of
Llama4DecoderLayer.forward to supply lora_params where required.
tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)

904-960: ⚠️ Potential issue | 🔴 Critical

Add lora_params to the method signature of Qwen3NextLinearDecoderLayer.forward

Line 959 uses lora_params but it is not declared as a parameter. Add lora_params: Optional[dict] = None, to the method signature (line 904-911) before **kwargs, matching the pattern used in Qwen3NextFullAttentionDecoderLayer.

Suggested fix
     def forward(
         self,
         position_ids: torch.IntTensor,
         hidden_states: torch.Tensor,
         attn_metadata: AttentionMetadata,
         residual: Optional[torch.Tensor],
         spec_metadata: Optional[SpecMetadata] = None,
+        lora_params: Optional[dict] = None,
         **kwargs,
     ) -> torch.Tensor:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 904 - 960,
The forward method of Qwen3NextLinearDecoderLayer references lora_params but
doesn't declare it; update the Qwen3NextLinearDecoderLayer.forward signature to
accept lora_params: Optional[dict] = None before **kwargs (matching the pattern
used in Qwen3NextFullAttentionDecoderLayer) so the lora_params passed into
self.mlp is defined and available inside the method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/runtime/loraModule.cpp`:
- Around line 29-30: The C++ multiplications (sharedExpertHidden =
sharedExpertHiddenSize * tpSize and moeHidden = moeHiddenSize * tpSize) assume
both inputs are per-TP, but shared_expert_hidden_size is currently returned as a
global size; fix _infer_shared_expert_size_from_adapter()/call site to convert
the adapter global dimension into a per-TP value by dividing
shared_expert_hidden_size by mapping.tp_size (same convention used for
moe_hidden_size computed as moe_intermediate // mapping.tp_size) before passing
it into LoraModule::createLoraModules(), so sharedExpertHiddenSize arrives as a
per-TP value and the C++ multiplication by tpSize remains correct.

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1258-1286: The _infer_shared_expert_size_from_adapter function
currently ignores tp_size, swallows all exceptions, and redundantly imports os;
update it to (1) remove the local "import os" and use the module-level os, (2)
when you compute the inferred size from the adapter tensor (the return value
inside the loop that uses tensor.shape[...] and rank from adapter_config.json),
divide that full-model size by tp_size using integer division (ensure tp_size >
0 before dividing), and (3) replace the bare "except Exception: pass" with a
specific handler like "except Exception as e:" that logs the exception (e.g.,
logging.getLogger(__name__).debug(...) or similar with exc_info) and then
returns 0 so failures are visible but non-fatal.

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/bindings.cpp`:
- Around line 199-220: The Python nanobind enum for tr::LoraModule::ModuleType
is missing the SHARED_EXPERT_* variants, causing the public API to differ from
runtime types; update the nb::enum_ binding in bindings.cpp (the block that
defines "LoraModuleType") to add .value(...) entries for each SHARED_EXPERT
variant present in tr::LoraModule::ModuleType (use the exact enum identifiers
from tr::LoraModule::ModuleType, e.g., kSHARED_EXPERT_... as defined in the C++
enum) so they map to names like "SHARED_EXPERT_H_TO_4H",
"SHARED_EXPERT_4H_TO_H", "SHARED_EXPERT_GATE", etc., and make the same additions
in the second enum declaration mentioned (around the other occurrence) so
create_lora_modules and the nanobind enum stay consistent.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Around line 609-624: Llama4MinLatencyMoE.forward accepts lora_params but
Llama4MinLatencyDecoderLayer.forward doesn't forward it into self.feed_forward,
causing LoRA to be ignored; update Llama4MinLatencyDecoderLayer.forward to
accept an optional lora_params argument (or propagate an existing one) and pass
it into the call to self.feed_forward (e.g., self.feed_forward(...,
lora_params=lora_params)), and ensure callers that invoke
Llama4MinLatencyDecoderLayer.forward (including wherever the decoder is invoked
from the min-latency decoder/decoder stack) propagate the lora_params from
Llama4MinLatencyMoE.forward so the MoE path and regular feed_forward both
receive the LoRA params.
- Around line 513-555: The decoder-layer construction path must pass the
per-layer index into the min-latency MoE so the shared-expert gets correct
context: update the place where Llama4MinLatencyDecoderLayer instantiates
Llama4MinLatencyMoE to forward the layer index (e.g., pass
layer_idx=self.layer_idx or layer_idx=layer_idx) into the Llama4MinLatencyMoE
constructor; ensure any other internal min-latency submodule creations in
Llama4MinLatencyDecoderLayer (such as shared_expert or calls that create
Llama4MinLatencyGatedMLP) also receive the same layer_idx so the per-layer
context is consistently set.

In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Around line 477-549: The forward method Llama4DecoderLayer.forward references
lora_params but doesn't declare it; add lora_params: Optional[dict] = None to
the forward signature (matching Llama4Attention.forward and Llama4MoE.forward)
so the symbol is defined, update the type imports if needed, and ensure any
internal calls that forward lora_params (e.g., the final self.feed_forward(...)
call) keep passing this parameter through; also scan and update callers of
Llama4DecoderLayer.forward to supply lora_params where required.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 904-960: The forward method of Qwen3NextLinearDecoderLayer
references lora_params but doesn't declare it; update the
Qwen3NextLinearDecoderLayer.forward signature to accept lora_params:
Optional[dict] = None before **kwargs (matching the pattern used in
Qwen3NextFullAttentionDecoderLayer) so the lora_params passed into self.mlp is
defined and available inside the method.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80aa8ca and 7af771a.

📒 Files selected for processing (14)
  • cpp/include/tensorrt_llm/runtime/loraModule.h
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/runtime/loraModule.cpp
  • cpp/tensorrt_llm/runtime/loraUtils.cpp
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py
  • tensorrt_llm/_torch/models/modeling_qwen3_next.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/modules/gated_mlp.py
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/lora_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

This pull request extends LoRA support to shared expert modules in mixture-of-experts (MoE) models. Three new LoRA module types are introduced across C++ and Python layers, lora_params is threaded through model forward paths, and configuration utilities are updated to handle shared expert and MoE-specific sizing.

Changes

Cohort / File(s) Summary
C++ LoRA Core Infrastructure
cpp/include/tensorrt_llm/runtime/loraModule.h, cpp/tensorrt_llm/runtime/loraModule.cpp
Added three new ModuleType enum values (kSHARED_EXPERT_H_TO_4H, kSHARED_EXPERT_4H_TO_H, kSHARED_EXPERT_GATE). Extended createLoraModules signature to accept sharedExpertHiddenSize and moeHiddenSize parameters. Updated module name-to-type and type-to-name conversions. Implemented conditional hidden size computation for shared expert and MOE-specific module instantiation.
C++ Bindings & Minor Fix
cpp/tensorrt_llm/nanobind/bindings.cpp, cpp/tensorrt_llm/runtime/loraUtils.cpp
Extended create_lora_modules Python binding to accept shared_expert_hidden_size and moe_hidden_size parameters. Fixed typo in error message ("to few" → "too few").
Python LoRA Module Types & GatedMLP
tensorrt_llm/_torch/peft/lora/layer.py, tensorrt_llm/_torch/modules/gated_mlp.py
Added three new LoraModuleType enum entries for shared expert variants. Updated GatedMLP.init to accept is_shared_expert parameter, enabling dynamic selection of LoRA module types based on shared expert mode.
Python Model Implementations
tensorrt_llm/_torch/models/modeling_llama.py, tensorrt_llm/_torch/models/modeling_llama_min_latency.py, tensorrt_llm/_torch/models/modeling_qwen3_next.py, tensorrt_llm/_torch/models/modeling_qwen_moe.py
Propagated lora_params through forward paths across Llama, Qwen MoE, and related decoder layers. Updated shared expert initialization with layer_idx and is_shared_expert flags. Threaded lora_params into shared expert and MLP invocations.
Python LoRA Management & Utilities
tensorrt_llm/lora_helper.py, tensorrt_llm/lora_manager.py, tensorrt_llm/_torch/pyexecutor/_util.py
Added shared expert module mappings to HF-to-TRTLLM module dictionary. Extended LoraManager.LORA_MODULE_IDS with three new shared expert entries. Implemented infer_shared_expert_size_from_adapter and enhanced LoRA target module resolution to conditionally substitute mlp_mod with shared_expert* variants for MoE models. Added moe_hidden_size and shared_expert_hidden_size parameter propagation to LoraModule.create_lora_modules.
Error Resilience Enhancement
tensorrt_llm/lora_manager.py (iterate_hf_lora)
Modified unsupported HF module handling to emit warnings and skip missing modules instead of raising assertion errors, improving robustness when encountering additional module types.
Test Additions
tests/unittest/llmapi/test_llm_pytorch.py
Added test_qwen_moe_shared_expert_lora function to verify that LoRA parameters affect output generation on Qwen MoE models with shared experts. Test appears duplicated in diff.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly describes the main change: adding shared expert LoRA support for MoE models in the PyTorch backend.
Description check ✅ Passed The PR description clearly explains the problem (dimension validation error with shared experts), the solution (dedicated module types), implementation details (size inference from adapters), and test coverage. It addresses all major sections of the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)

477-549: ⚠️ Potential issue | 🔴 Critical

lora_params is undefined in Llama4DecoderLayer.forward (runtime NameError).

Line 548 references lora_params, but this variable is never extracted from **kwargs. This will fail at runtime on this code path.

Proposed fix
     def forward(
         self,
@@
         spec_metadata: Optional[SpecMetadata] = None,
         **kwargs,
     ) -> torch.Tensor:
+        lora_params = kwargs.get("lora_params")
@@
         hidden_states = self.feed_forward(
             hidden_states,
             all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
             final_all_reduce_params=AllReduceParams(
                 enable_allreduce=not self.disable_feed_forward_allreduce),
             cutlass_min_latency_mode=cutlass_min_latency_mode,
             lora_params=lora_params,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama.py` around lines 477 - 549, The
forward method Llama4DecoderLayer.forward references lora_params but never
extracts it from kwargs, causing a NameError; fix by pulling lora_params =
kwargs.pop("lora_params", None) (or kwargs.get("lora_params") if you want to
leave kwargs intact) near the start of the method (after the
cutlass_min_latency_mode/residual handling) and then pass that variable into
self.feed_forward as currently done; ensure any callers relying on kwargs still
work and remove lora_params from kwargs if you used pop to avoid unexpected
extra arguments downstream.
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)

525-555: ⚠️ Potential issue | 🟠 Major

Shared-expert LoRA is still non-functional in the min-latency Llama4 path.

Llama4MinLatencyMoE now accepts lora_params/layer_idx, but this file still constructs Llama4MinLatencyMoE without layer_idx and calls self.feed_forward(...) without lora_params. That means shared-expert LoRA is effectively disabled, and if wired later it can hit the layer_idx assertion in LoRA execution.

🛠️ Proposed fix
diff --git a/tensorrt_llm/_torch/models/modeling_llama_min_latency.py b/tensorrt_llm/_torch/models/modeling_llama_min_latency.py
@@ class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer):
             self.feed_forward = Llama4MinLatencyMoE(
                 num_experts=config.num_local_experts,
                 top_k=config.num_experts_per_tok,
                 hidden_size=config.hidden_size,
                 intermediate_size=config.intermediate_size,
                 shared_expert_intermediate_size=config.intermediate_size,
                 model_config=model_config,
                 aux_stream=aux_stream,
-                dtype=config.torch_dtype)
+                dtype=config.torch_dtype,
+                layer_idx=layer_idx)
@@ class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer):
         else:
             hidden_states = self.feed_forward(
                 hidden_states,
                 all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
                 final_all_reduce_params=AllReduceParams(enable_allreduce=not (
                     self.fusion_config.POST_MOE_FUSION
                     or self.mapping.tp_size == 1 or self.enable_attention_dp)),
                 hidden_states_high=hidden_states_high,
+                lora_params=kwargs.get("lora_params"),
             )

Also applies to: 617-620

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 525 -
555, The min-latency Llama4 MoE path is constructing/using layers without
passing layer_idx and lora_params, disabling shared-expert LoRA and risking the
LoRA layer_idx assertion; fix by wiring the missing arguments: when you
instantiate Llama4MinLatencyMoE and Llama4MinLatencyGatedMLP (e.g., the
shared_expert) pass the current layer_idx and any lora_params through, and
update calls to self.feed_forward(...) to forward lora_params (and ensure
layer_idx is set on the layer instance if None). Specifically, locate
Llama4MinLatencyMoE construction and self.feed_forward usages and add the
layer_idx and lora_params parameters so shared-expert LoRA receives both
identifiers and params.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)

1258-1286: ⚠️ Potential issue | 🟠 Major

Use per-TP shared-expert sizing and stop swallowing all inference errors.

Line 1259 takes tp_size, but Line 1282 returns the full inferred dimension. For TP>1, this can mis-size shared-expert LoRA module creation relative to the per-TP sizing used at Line 982. Also, Line 1284 silently drops all exceptions.

Proposed fix
 def _infer_shared_expert_size_from_adapter(adapter_dir: str,
                                            tp_size: int) -> int:
@@
-    import json
-    import os
+    import json
 
     try:
         from tensorrt_llm.lora_manager import get_model_path, load_state_dict
@@
                 with open(adapter_config_path) as f:
                     rank = json.load(f).get("r", 0)
                 if rank > 0:
-                    return tensor.shape[1] if tensor.shape[
-                        0] == rank else tensor.shape[0]
-    except Exception:
-        pass
+                    full_size = tensor.shape[1] if tensor.shape[0] == rank else tensor.shape[0]
+                    if tp_size <= 0 or full_size % tp_size != 0:
+                        logger.warning(
+                            "Invalid TP sizing while inferring shared expert size: full_size=%s, tp_size=%s",
+                            full_size, tp_size)
+                        return 0
+                    return full_size // tp_size
+    except (FileNotFoundError, KeyError, OSError, json.JSONDecodeError) as err:
+        logger.debug("Failed to infer shared expert size from %s: %s", adapter_dir, err)
+    except Exception as err:
+        logger.warning("Unexpected error inferring shared expert size from %s: %s", adapter_dir, err)
     return 0

As per coding guidelines: "Catch specific exceptions, not bare except:."
Based on learnings: "In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1258 - 1286,
_infer_shared_expert_size_from_adapter currently returns the full shared-expert
dimension and swallows all errors; change it to compute the per-TP size by
dividing the inferred dimension by the tp_size parameter (use integer division)
and replace the bare except: with specific exception handling (e.g.,
ImportError, FileNotFoundError, json.JSONDecodeError, KeyError, AttributeError)
and log the exception instead of silently passing; update the return path to
return 0 on handled errors. Ensure you update the logic that computes
inferred_dim (from tensor.shape) to assign inferred = ... then return inferred
// tp_size.
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)

973-977: Improve logprob comparison robustness and use idiomatic dict value extraction.

The code uses zip() without length validation, which silently truncates if the logprob sequences differ in length—potentially masking test failures. Additionally, next(iter(...)) is more idiomatic than list(...)[0] for single-value extraction.

♻️ Suggested improvements
         if logprobs_with and logprobs_without:
+            if len(logprobs_with) != len(logprobs_without):
+                logprobs_differ = True
+            else:
-            for lp_w, lp_wo in zip(logprobs_with, logprobs_without):
-                lp_val_w = list(lp_w.values())[0].logprob
-                lp_val_wo = list(lp_wo.values())[0].logprob
+                for lp_w, lp_wo in zip(logprobs_with, logprobs_without, strict=True):
+                    lp_val_w = next(iter(lp_w.values())).logprob
+                    lp_val_wo = next(iter(lp_wo.values())).logprob
                 if abs(lp_val_w - lp_val_wo) > 1e-6:
                     logprobs_differ = True
                     break

Note: strict=True requires Python 3.10+, which matches the project requirement (requires-python = ">=3.10,<3.13").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/llmapi/test_llm_pytorch.py` around lines 973 - 977, The test
currently zips logprobs_with and logprobs_without without validating equal
length and extracts the single dict value via list(...)[0]; update the loop to
use zip(logprobs_with, logprobs_without, strict=True) so differing lengths raise
immediately, and replace list(lp_w.values())[0] and list(lp_wo.values())[0] with
the idiomatic next(iter(lp_w.values())) and next(iter(lp_wo.values())) when
extracting lp_val_w and lp_val_wo in the test that compares log probabilities.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Line 340: The lambda assignments to fn0 and fn1 in modeling_llama.py
(currently: fn0 = lambda: self.shared_expert(hidden_states,
lora_params=lora_params) and the similar fn1) violate Ruff E731; replace each
lambda with a proper local def function (e.g., def fn0(): return
self.shared_expert(hidden_states, lora_params=lora_params)) and likewise for fn1
so they capture hidden_states and lora_params the same way but as named local
functions; update any references to fn0/fn1 accordingly.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Line 959: Qwen3NextLinearDecoderLayer.forward is forwarding unknown kwargs
(including lora_params) into self.linear_attn, causing
Qwen3NextGatedDeltaNet.forward to raise TypeError; fix by extracting lora_params
from kwargs (e.g. lora_params = kwargs.pop("lora_params", None)) inside
Qwen3NextLinearDecoderLayer.forward and then pass it explicitly to
self.linear_attn as a named argument (self.linear_attn(...,
lora_params=lora_params)); ensure no unexpected kwargs are forwarded to
Qwen3NextGatedDeltaNet.forward and, if desired, update
Qwen3NextGatedDeltaNet.forward to accept a lora_params keyword (defaulting to
None) so the value is consumed safely.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Around line 525-555: The min-latency Llama4 MoE path is constructing/using
layers without passing layer_idx and lora_params, disabling shared-expert LoRA
and risking the LoRA layer_idx assertion; fix by wiring the missing arguments:
when you instantiate Llama4MinLatencyMoE and Llama4MinLatencyGatedMLP (e.g., the
shared_expert) pass the current layer_idx and any lora_params through, and
update calls to self.feed_forward(...) to forward lora_params (and ensure
layer_idx is set on the layer instance if None). Specifically, locate
Llama4MinLatencyMoE construction and self.feed_forward usages and add the
layer_idx and lora_params parameters so shared-expert LoRA receives both
identifiers and params.

In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Around line 477-549: The forward method Llama4DecoderLayer.forward references
lora_params but never extracts it from kwargs, causing a NameError; fix by
pulling lora_params = kwargs.pop("lora_params", None) (or
kwargs.get("lora_params") if you want to leave kwargs intact) near the start of
the method (after the cutlass_min_latency_mode/residual handling) and then pass
that variable into self.feed_forward as currently done; ensure any callers
relying on kwargs still work and remove lora_params from kwargs if you used pop
to avoid unexpected extra arguments downstream.

---

Duplicate comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1258-1286: _infer_shared_expert_size_from_adapter currently
returns the full shared-expert dimension and swallows all errors; change it to
compute the per-TP size by dividing the inferred dimension by the tp_size
parameter (use integer division) and replace the bare except: with specific
exception handling (e.g., ImportError, FileNotFoundError, json.JSONDecodeError,
KeyError, AttributeError) and log the exception instead of silently passing;
update the return path to return 0 on handled errors. Ensure you update the
logic that computes inferred_dim (from tensor.shape) to assign inferred = ...
then return inferred // tp_size.

---

Nitpick comments:
In `@tests/unittest/llmapi/test_llm_pytorch.py`:
- Around line 973-977: The test currently zips logprobs_with and
logprobs_without without validating equal length and extracts the single dict
value via list(...)[0]; update the loop to use zip(logprobs_with,
logprobs_without, strict=True) so differing lengths raise immediately, and
replace list(lp_w.values())[0] and list(lp_wo.values())[0] with the idiomatic
next(iter(lp_w.values())) and next(iter(lp_wo.values())) when extracting
lp_val_w and lp_val_wo in the test that compares log probabilities.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80aa8ca and 7af771a.

📒 Files selected for processing (14)
  • cpp/include/tensorrt_llm/runtime/loraModule.h
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/runtime/loraModule.cpp
  • cpp/tensorrt_llm/runtime/loraUtils.cpp
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py
  • tensorrt_llm/_torch/models/modeling_qwen3_next.py
  • tensorrt_llm/_torch/models/modeling_qwen_moe.py
  • tensorrt_llm/_torch/modules/gated_mlp.py
  • tensorrt_llm/_torch/peft/lora/layer.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/lora_helper.py
  • tensorrt_llm/lora_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py

@achartier
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36977 [ run ] triggered by Bot. Commit: f16ced0 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36977 [ run ] completed with state SUCCESS. Commit: f16ced0
/LLM/main/L0_MergeRequest_PR pipeline #28630 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@achartier
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37014 [ run ] triggered by Bot. Commit: b5e4d90 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37014 [ run ] completed with state SUCCESS. Commit: b5e4d90
/LLM/main/L0_MergeRequest_PR pipeline #28658 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@achartier
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37097 [ run ] triggered by Bot. Commit: 7f04e21 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37097 [ run ] completed with state SUCCESS. Commit: 7f04e21
/LLM/main/L0_MergeRequest_PR pipeline #28723 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@achartier
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37108 [ run ] triggered by Bot. Commit: 7f04e21 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37108 [ run ] completed with state SUCCESS. Commit: 7f04e21
/LLM/main/L0_MergeRequest_PR pipeline #28728 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37566 [ run ] triggered by Bot. Commit: c02eaae Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37566 [ run ] completed with state SUCCESS. Commit: c02eaae
/LLM/main/L0_MergeRequest_PR pipeline #29067 completed with status: 'SUCCESS'

Link to invocation

@venkywonka
Copy link
Collaborator

qq i'm not sure about this - but are Qwen MoE, Qwen3Next, Llama4 verified to be the entire list of shared MoE models supported today? the PR description says it enables for all shared moe models but see changes only in the above models.

@achartier
Copy link
Collaborator Author

qq i'm not sure about this - but are Qwen MoE, Qwen3Next, Llama4 verified to be the entire list of shared MoE models supported today? the PR description says it enables for all shared moe models but see changes only in the above models.

I've only tested Qwen since it has publicly available LoRA weights, I can update the description. The llama changes are just mirroring the Qwen ones for consistency. I can test it (and other MoE models) later on.

@achartier achartier force-pushed the shared-expert-moe branch from 21b3d8b to 450bc60 Compare March 9, 2026 18:54
Copy link
Collaborator

@xxi-nv xxi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve the change of the Models.

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
a known prefix like shared_expert

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
@achartier
Copy link
Collaborator Author

/bot skip "ci passed"

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@achartier
Copy link
Collaborator Author

/bot skip

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@achartier
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38664 [ reuse-pipeline ] triggered by Bot. Commit: 7ff08a7 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38664 [ reuse-pipeline ] completed with state SUCCESS. Commit: 7ff08a7
Can't reuse PR_Github #0 with status: UNKNOWN

Link to invocation

@achartier
Copy link
Collaborator Author

/bot skip --comment "ci passed"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38665 [ skip ] triggered by Bot. Commit: 7ff08a7 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38665 [ skip ] completed with state SUCCESS. Commit: 7ff08a7
Skipping testing for commit 7ff08a7

Link to invocation

@achartier achartier merged commit 8de01ac into NVIDIA:main Mar 12, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants