[None][feat] Add shared expert LoRA support for MoE models in PyTorch backend#11760
[None][feat] Add shared expert LoRA support for MoE models in PyTorch backend#11760achartier merged 8 commits intoNVIDIA:mainfrom
Conversation
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis PR adds LoRA (Low-Rank Adaptation) support for shared expert layers in Mixture of Experts models. It introduces three new module types for shared experts, extends the LoRA module creation API with sizing parameters (sharedExpertHiddenSize, moeHiddenSize), and threads lora_params through forward paths in PyTorch models (Llama, Qwen variants). It also adds inference logic to derive shared expert sizing from adapters and updates module mappings and type handling accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
cpp/tensorrt_llm/nanobind/bindings.cpp (1)
199-220:⚠️ Potential issue | 🟠 MajorExpose shared-expert module enums in nanobind too.
create_lora_modulesnow supports shared-expert sizing, but Python binding forLoraModuleTypestill does not includeSHARED_EXPERT_*. That leaves the public API inconsistent with runtime types.🔧 Proposed fix
nb::enum_<tr::LoraModule::ModuleType>(m, "LoraModuleType") .value("INVALID", tr::LoraModule::ModuleType::kINVALID) .value("ATTN_QKV", tr::LoraModule::ModuleType::kATTN_QKV) .value("ATTN_Q", tr::LoraModule::ModuleType::kATTN_Q) .value("ATTN_K", tr::LoraModule::ModuleType::kATTN_K) .value("ATTN_V", tr::LoraModule::ModuleType::kATTN_V) .value("ATTN_DENSE", tr::LoraModule::ModuleType::kATTN_DENSE) .value("MLP_H_TO_4H", tr::LoraModule::ModuleType::kMLP_H_TO_4H) .value("MLP_4H_TO_H", tr::LoraModule::ModuleType::kMLP_4H_TO_H) .value("MLP_GATE", tr::LoraModule::ModuleType::kMLP_GATE) .value("CROSS_ATTN_QKV", tr::LoraModule::ModuleType::kCROSS_ATTN_QKV) .value("CROSS_ATTN_Q", tr::LoraModule::ModuleType::kCROSS_ATTN_Q) .value("CROSS_ATTN_K", tr::LoraModule::ModuleType::kCROSS_ATTN_K) .value("CROSS_ATTN_V", tr::LoraModule::ModuleType::kCROSS_ATTN_V) .value("CROSS_ATTN_DENSE", tr::LoraModule::ModuleType::kCROSS_ATTN_DENSE) .value("MOE_H_TO_4H", tr::LoraModule::ModuleType::kMOE_H_TO_4H) .value("MOE_4H_TO_H", tr::LoraModule::ModuleType::kMOE_4H_TO_H) .value("MOE_GATE", tr::LoraModule::ModuleType::kMOE_GATE) .value("MOE_ROUTER", tr::LoraModule::ModuleType::kMOE_ROUTER) .value("MLP_ROUTER", tr::LoraModule::ModuleType::kMLP_ROUTER) - .value("MLP_GATE_UP", tr::LoraModule::ModuleType::kMLP_GATE_UP); + .value("MLP_GATE_UP", tr::LoraModule::ModuleType::kMLP_GATE_UP) + .value("SHARED_EXPERT_H_TO_4H", tr::LoraModule::ModuleType::kSHARED_EXPERT_H_TO_4H) + .value("SHARED_EXPERT_4H_TO_H", tr::LoraModule::ModuleType::kSHARED_EXPERT_4H_TO_H) + .value("SHARED_EXPERT_GATE", tr::LoraModule::ModuleType::kSHARED_EXPERT_GATE);Also applies to: 235-235
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/nanobind/bindings.cpp` around lines 199 - 220, The Python nanobind enum for tr::LoraModule::ModuleType is missing the SHARED_EXPERT_* variants, causing the public API to differ from runtime types; update the nb::enum_ binding in bindings.cpp (the block that defines "LoraModuleType") to add .value(...) entries for each SHARED_EXPERT variant present in tr::LoraModule::ModuleType (use the exact enum identifiers from tr::LoraModule::ModuleType, e.g., kSHARED_EXPERT_... as defined in the C++ enum) so they map to names like "SHARED_EXPERT_H_TO_4H", "SHARED_EXPERT_4H_TO_H", "SHARED_EXPERT_GATE", etc., and make the same additions in the second enum declaration mentioned (around the other occurrence) so create_lora_modules and the nanobind enum stay consistent.tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2)
609-624:⚠️ Potential issue | 🟠 Major
lora_paramssupport is added in MoE but not wired from min-latency decoder
Llama4MinLatencyMoE.forwardnow acceptslora_params(Line 617), butLlama4MinLatencyDecoderLayer.forwarddoes not pass it toself.feed_forward(...). In min-latency mode, LoRA can be dropped silently.💡 Suggested fix
class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer): @@ def forward( self, position_ids: torch.LongTensor, hidden_states: Union[torch.Tensor, Fp4QuantizedTensor], attn_metadata: AttentionMetadata, residual: Optional[torch.Tensor], spec_metadata: Optional[SpecMetadata] = None, **kwargs, ) -> torch.Tensor: + lora_params = kwargs.get("lora_params") @@ if self.is_mlp_layer: hidden_states = self.feed_forward( hidden_states, all_rank_num_tokens=attn_metadata.all_rank_num_tokens, final_all_reduce_params=AllReduceParams(enable_allreduce=not ( self.fusion_config.POST_MLP_FUSION or self.mapping.tp_size == 1 or self.enable_attention_dp)), + lora_params=lora_params, ) else: hidden_states = self.feed_forward( hidden_states, all_rank_num_tokens=attn_metadata.all_rank_num_tokens, final_all_reduce_params=AllReduceParams(enable_allreduce=not ( self.fusion_config.POST_MOE_FUSION or self.mapping.tp_size == 1 or self.enable_attention_dp)), hidden_states_high=hidden_states_high, + lora_params=lora_params, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 609 - 624, Llama4MinLatencyMoE.forward accepts lora_params but Llama4MinLatencyDecoderLayer.forward doesn't forward it into self.feed_forward, causing LoRA to be ignored; update Llama4MinLatencyDecoderLayer.forward to accept an optional lora_params argument (or propagate an existing one) and pass it into the call to self.feed_forward (e.g., self.feed_forward(..., lora_params=lora_params)), and ensure callers that invoke Llama4MinLatencyDecoderLayer.forward (including wherever the decoder is invoked from the min-latency decoder/decoder stack) propagate the lora_params from Llama4MinLatencyMoE.forward so the MoE path and regular feed_forward both receive the LoRA params.
513-555:⚠️ Potential issue | 🟠 Major
layer_idxwas added but is not consistently propagated in min-latency construction path
Llama4MinLatencyMoEnow takeslayer_idxand forwards it into shared-expert internals, but its creation path inLlama4MinLatencyDecoderLayeromitslayer_idx, leaving the new per-layer context unset.💡 Suggested fix
else: self.feed_forward = Llama4MinLatencyMoE( num_experts=config.num_local_experts, top_k=config.num_experts_per_tok, hidden_size=config.hidden_size, intermediate_size=config.intermediate_size, shared_expert_intermediate_size=config.intermediate_size, model_config=model_config, aux_stream=aux_stream, - dtype=config.torch_dtype) + dtype=config.torch_dtype, + layer_idx=layer_idx)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 513 - 555, The decoder-layer construction path must pass the per-layer index into the min-latency MoE so the shared-expert gets correct context: update the place where Llama4MinLatencyDecoderLayer instantiates Llama4MinLatencyMoE to forward the layer index (e.g., pass layer_idx=self.layer_idx or layer_idx=layer_idx) into the Llama4MinLatencyMoE constructor; ensure any other internal min-latency submodule creations in Llama4MinLatencyDecoderLayer (such as shared_expert or calls that create Llama4MinLatencyGatedMLP) also receive the same layer_idx so the per-layer context is consistently set.tensorrt_llm/_torch/models/modeling_llama.py (1)
477-549:⚠️ Potential issue | 🔴 Critical
Llama4DecoderLayer.forwarduseslora_paramswithout declaring itAt line 548,
lora_paramsis referenced but not defined in the function scope. This will cause aNameErrorat runtime. Addlora_params: Optional[dict] = Noneto the function signature, consistent with the signatures ofLlama4Attention.forwardandLlama4MoE.forwardin the same file.Suggested fix
def forward( self, position_ids: torch.IntTensor, hidden_states: Union[torch.Tensor, Fp4QuantizedTensor], attn_metadata: AttentionMetadata, residual: Optional[torch.Tensor], spec_metadata: Optional[SpecMetadata] = None, + lora_params: Optional[dict] = None, **kwargs, ) -> torch.Tensor:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama.py` around lines 477 - 549, The forward method Llama4DecoderLayer.forward references lora_params but doesn't declare it; add lora_params: Optional[dict] = None to the forward signature (matching Llama4Attention.forward and Llama4MoE.forward) so the symbol is defined, update the type imports if needed, and ensure any internal calls that forward lora_params (e.g., the final self.feed_forward(...) call) keep passing this parameter through; also scan and update callers of Llama4DecoderLayer.forward to supply lora_params where required.tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)
904-960:⚠️ Potential issue | 🔴 CriticalAdd
lora_paramsto the method signature ofQwen3NextLinearDecoderLayer.forwardLine 959 uses
lora_paramsbut it is not declared as a parameter. Addlora_params: Optional[dict] = None,to the method signature (line 904-911) before**kwargs, matching the pattern used inQwen3NextFullAttentionDecoderLayer.Suggested fix
def forward( self, position_ids: torch.IntTensor, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, residual: Optional[torch.Tensor], spec_metadata: Optional[SpecMetadata] = None, + lora_params: Optional[dict] = None, **kwargs, ) -> torch.Tensor:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 904 - 960, The forward method of Qwen3NextLinearDecoderLayer references lora_params but doesn't declare it; update the Qwen3NextLinearDecoderLayer.forward signature to accept lora_params: Optional[dict] = None before **kwargs (matching the pattern used in Qwen3NextFullAttentionDecoderLayer) so the lora_params passed into self.mlp is defined and available inside the method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/runtime/loraModule.cpp`:
- Around line 29-30: The C++ multiplications (sharedExpertHidden =
sharedExpertHiddenSize * tpSize and moeHidden = moeHiddenSize * tpSize) assume
both inputs are per-TP, but shared_expert_hidden_size is currently returned as a
global size; fix _infer_shared_expert_size_from_adapter()/call site to convert
the adapter global dimension into a per-TP value by dividing
shared_expert_hidden_size by mapping.tp_size (same convention used for
moe_hidden_size computed as moe_intermediate // mapping.tp_size) before passing
it into LoraModule::createLoraModules(), so sharedExpertHiddenSize arrives as a
per-TP value and the C++ multiplication by tpSize remains correct.
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1258-1286: The _infer_shared_expert_size_from_adapter function
currently ignores tp_size, swallows all exceptions, and redundantly imports os;
update it to (1) remove the local "import os" and use the module-level os, (2)
when you compute the inferred size from the adapter tensor (the return value
inside the loop that uses tensor.shape[...] and rank from adapter_config.json),
divide that full-model size by tp_size using integer division (ensure tp_size >
0 before dividing), and (3) replace the bare "except Exception: pass" with a
specific handler like "except Exception as e:" that logs the exception (e.g.,
logging.getLogger(__name__).debug(...) or similar with exc_info) and then
returns 0 so failures are visible but non-fatal.
---
Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/bindings.cpp`:
- Around line 199-220: The Python nanobind enum for tr::LoraModule::ModuleType
is missing the SHARED_EXPERT_* variants, causing the public API to differ from
runtime types; update the nb::enum_ binding in bindings.cpp (the block that
defines "LoraModuleType") to add .value(...) entries for each SHARED_EXPERT
variant present in tr::LoraModule::ModuleType (use the exact enum identifiers
from tr::LoraModule::ModuleType, e.g., kSHARED_EXPERT_... as defined in the C++
enum) so they map to names like "SHARED_EXPERT_H_TO_4H",
"SHARED_EXPERT_4H_TO_H", "SHARED_EXPERT_GATE", etc., and make the same additions
in the second enum declaration mentioned (around the other occurrence) so
create_lora_modules and the nanobind enum stay consistent.
In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Around line 609-624: Llama4MinLatencyMoE.forward accepts lora_params but
Llama4MinLatencyDecoderLayer.forward doesn't forward it into self.feed_forward,
causing LoRA to be ignored; update Llama4MinLatencyDecoderLayer.forward to
accept an optional lora_params argument (or propagate an existing one) and pass
it into the call to self.feed_forward (e.g., self.feed_forward(...,
lora_params=lora_params)), and ensure callers that invoke
Llama4MinLatencyDecoderLayer.forward (including wherever the decoder is invoked
from the min-latency decoder/decoder stack) propagate the lora_params from
Llama4MinLatencyMoE.forward so the MoE path and regular feed_forward both
receive the LoRA params.
- Around line 513-555: The decoder-layer construction path must pass the
per-layer index into the min-latency MoE so the shared-expert gets correct
context: update the place where Llama4MinLatencyDecoderLayer instantiates
Llama4MinLatencyMoE to forward the layer index (e.g., pass
layer_idx=self.layer_idx or layer_idx=layer_idx) into the Llama4MinLatencyMoE
constructor; ensure any other internal min-latency submodule creations in
Llama4MinLatencyDecoderLayer (such as shared_expert or calls that create
Llama4MinLatencyGatedMLP) also receive the same layer_idx so the per-layer
context is consistently set.
In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Around line 477-549: The forward method Llama4DecoderLayer.forward references
lora_params but doesn't declare it; add lora_params: Optional[dict] = None to
the forward signature (matching Llama4Attention.forward and Llama4MoE.forward)
so the symbol is defined, update the type imports if needed, and ensure any
internal calls that forward lora_params (e.g., the final self.feed_forward(...)
call) keep passing this parameter through; also scan and update callers of
Llama4DecoderLayer.forward to supply lora_params where required.
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 904-960: The forward method of Qwen3NextLinearDecoderLayer
references lora_params but doesn't declare it; update the
Qwen3NextLinearDecoderLayer.forward signature to accept lora_params:
Optional[dict] = None before **kwargs (matching the pattern used in
Qwen3NextFullAttentionDecoderLayer) so the lora_params passed into self.mlp is
defined and available inside the method.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
cpp/include/tensorrt_llm/runtime/loraModule.hcpp/tensorrt_llm/nanobind/bindings.cppcpp/tensorrt_llm/runtime/loraModule.cppcpp/tensorrt_llm/runtime/loraUtils.cpptensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytensorrt_llm/_torch/models/modeling_qwen3_next.pytensorrt_llm/_torch/models/modeling_qwen_moe.pytensorrt_llm/_torch/modules/gated_mlp.pytensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/lora_helper.pytensorrt_llm/lora_manager.pytests/unittest/llmapi/test_llm_pytorch.py
📝 WalkthroughWalkthroughThis pull request extends LoRA support to shared expert modules in mixture-of-experts (MoE) models. Three new LoRA module types are introduced across C++ and Python layers, lora_params is threaded through model forward paths, and configuration utilities are updated to handle shared expert and MoE-specific sizing. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
477-549:⚠️ Potential issue | 🔴 Critical
lora_paramsis undefined inLlama4DecoderLayer.forward(runtime NameError).Line 548 references
lora_params, but this variable is never extracted from**kwargs. This will fail at runtime on this code path.Proposed fix
def forward( self, @@ spec_metadata: Optional[SpecMetadata] = None, **kwargs, ) -> torch.Tensor: + lora_params = kwargs.get("lora_params") @@ hidden_states = self.feed_forward( hidden_states, all_rank_num_tokens=attn_metadata.all_rank_num_tokens, final_all_reduce_params=AllReduceParams( enable_allreduce=not self.disable_feed_forward_allreduce), cutlass_min_latency_mode=cutlass_min_latency_mode, lora_params=lora_params, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama.py` around lines 477 - 549, The forward method Llama4DecoderLayer.forward references lora_params but never extracts it from kwargs, causing a NameError; fix by pulling lora_params = kwargs.pop("lora_params", None) (or kwargs.get("lora_params") if you want to leave kwargs intact) near the start of the method (after the cutlass_min_latency_mode/residual handling) and then pass that variable into self.feed_forward as currently done; ensure any callers relying on kwargs still work and remove lora_params from kwargs if you used pop to avoid unexpected extra arguments downstream.tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
525-555:⚠️ Potential issue | 🟠 MajorShared-expert LoRA is still non-functional in the min-latency Llama4 path.
Llama4MinLatencyMoEnow acceptslora_params/layer_idx, but this file still constructsLlama4MinLatencyMoEwithoutlayer_idxand callsself.feed_forward(...)withoutlora_params. That means shared-expert LoRA is effectively disabled, and if wired later it can hit thelayer_idxassertion in LoRA execution.🛠️ Proposed fix
diff --git a/tensorrt_llm/_torch/models/modeling_llama_min_latency.py b/tensorrt_llm/_torch/models/modeling_llama_min_latency.py @@ class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer): self.feed_forward = Llama4MinLatencyMoE( num_experts=config.num_local_experts, top_k=config.num_experts_per_tok, hidden_size=config.hidden_size, intermediate_size=config.intermediate_size, shared_expert_intermediate_size=config.intermediate_size, model_config=model_config, aux_stream=aux_stream, - dtype=config.torch_dtype) + dtype=config.torch_dtype, + layer_idx=layer_idx) @@ class Llama4MinLatencyDecoderLayer(Llama4DecoderLayer): else: hidden_states = self.feed_forward( hidden_states, all_rank_num_tokens=attn_metadata.all_rank_num_tokens, final_all_reduce_params=AllReduceParams(enable_allreduce=not ( self.fusion_config.POST_MOE_FUSION or self.mapping.tp_size == 1 or self.enable_attention_dp)), hidden_states_high=hidden_states_high, + lora_params=kwargs.get("lora_params"), )Also applies to: 617-620
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` around lines 525 - 555, The min-latency Llama4 MoE path is constructing/using layers without passing layer_idx and lora_params, disabling shared-expert LoRA and risking the LoRA layer_idx assertion; fix by wiring the missing arguments: when you instantiate Llama4MinLatencyMoE and Llama4MinLatencyGatedMLP (e.g., the shared_expert) pass the current layer_idx and any lora_params through, and update calls to self.feed_forward(...) to forward lora_params (and ensure layer_idx is set on the layer instance if None). Specifically, locate Llama4MinLatencyMoE construction and self.feed_forward usages and add the layer_idx and lora_params parameters so shared-expert LoRA receives both identifiers and params.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
1258-1286:⚠️ Potential issue | 🟠 MajorUse per-TP shared-expert sizing and stop swallowing all inference errors.
Line 1259 takes
tp_size, but Line 1282 returns the full inferred dimension. For TP>1, this can mis-size shared-expert LoRA module creation relative to the per-TP sizing used at Line 982. Also, Line 1284 silently drops all exceptions.Proposed fix
def _infer_shared_expert_size_from_adapter(adapter_dir: str, tp_size: int) -> int: @@ - import json - import os + import json try: from tensorrt_llm.lora_manager import get_model_path, load_state_dict @@ with open(adapter_config_path) as f: rank = json.load(f).get("r", 0) if rank > 0: - return tensor.shape[1] if tensor.shape[ - 0] == rank else tensor.shape[0] - except Exception: - pass + full_size = tensor.shape[1] if tensor.shape[0] == rank else tensor.shape[0] + if tp_size <= 0 or full_size % tp_size != 0: + logger.warning( + "Invalid TP sizing while inferring shared expert size: full_size=%s, tp_size=%s", + full_size, tp_size) + return 0 + return full_size // tp_size + except (FileNotFoundError, KeyError, OSError, json.JSONDecodeError) as err: + logger.debug("Failed to infer shared expert size from %s: %s", adapter_dir, err) + except Exception as err: + logger.warning("Unexpected error inferring shared expert size from %s: %s", adapter_dir, err) return 0As per coding guidelines: "Catch specific exceptions, not bare
except:."
Based on learnings: "In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1258 - 1286, _infer_shared_expert_size_from_adapter currently returns the full shared-expert dimension and swallows all errors; change it to compute the per-TP size by dividing the inferred dimension by the tp_size parameter (use integer division) and replace the bare except: with specific exception handling (e.g., ImportError, FileNotFoundError, json.JSONDecodeError, KeyError, AttributeError) and log the exception instead of silently passing; update the return path to return 0 on handled errors. Ensure you update the logic that computes inferred_dim (from tensor.shape) to assign inferred = ... then return inferred // tp_size.
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)
973-977: Improve logprob comparison robustness and use idiomatic dict value extraction.The code uses
zip()without length validation, which silently truncates if the logprob sequences differ in length—potentially masking test failures. Additionally,next(iter(...))is more idiomatic thanlist(...)[0]for single-value extraction.♻️ Suggested improvements
if logprobs_with and logprobs_without: + if len(logprobs_with) != len(logprobs_without): + logprobs_differ = True + else: - for lp_w, lp_wo in zip(logprobs_with, logprobs_without): - lp_val_w = list(lp_w.values())[0].logprob - lp_val_wo = list(lp_wo.values())[0].logprob + for lp_w, lp_wo in zip(logprobs_with, logprobs_without, strict=True): + lp_val_w = next(iter(lp_w.values())).logprob + lp_val_wo = next(iter(lp_wo.values())).logprob if abs(lp_val_w - lp_val_wo) > 1e-6: logprobs_differ = True breakNote:
strict=Truerequires Python 3.10+, which matches the project requirement (requires-python = ">=3.10,<3.13").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/llmapi/test_llm_pytorch.py` around lines 973 - 977, The test currently zips logprobs_with and logprobs_without without validating equal length and extracts the single dict value via list(...)[0]; update the loop to use zip(logprobs_with, logprobs_without, strict=True) so differing lengths raise immediately, and replace list(lp_w.values())[0] and list(lp_wo.values())[0] with the idiomatic next(iter(lp_w.values())) and next(iter(lp_wo.values())) when extracting lp_val_w and lp_val_wo in the test that compares log probabilities.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Line 340: The lambda assignments to fn0 and fn1 in modeling_llama.py
(currently: fn0 = lambda: self.shared_expert(hidden_states,
lora_params=lora_params) and the similar fn1) violate Ruff E731; replace each
lambda with a proper local def function (e.g., def fn0(): return
self.shared_expert(hidden_states, lora_params=lora_params)) and likewise for fn1
so they capture hidden_states and lora_params the same way but as named local
functions; update any references to fn0/fn1 accordingly.
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Line 959: Qwen3NextLinearDecoderLayer.forward is forwarding unknown kwargs
(including lora_params) into self.linear_attn, causing
Qwen3NextGatedDeltaNet.forward to raise TypeError; fix by extracting lora_params
from kwargs (e.g. lora_params = kwargs.pop("lora_params", None)) inside
Qwen3NextLinearDecoderLayer.forward and then pass it explicitly to
self.linear_attn as a named argument (self.linear_attn(...,
lora_params=lora_params)); ensure no unexpected kwargs are forwarded to
Qwen3NextGatedDeltaNet.forward and, if desired, update
Qwen3NextGatedDeltaNet.forward to accept a lora_params keyword (defaulting to
None) so the value is consumed safely.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Around line 525-555: The min-latency Llama4 MoE path is constructing/using
layers without passing layer_idx and lora_params, disabling shared-expert LoRA
and risking the LoRA layer_idx assertion; fix by wiring the missing arguments:
when you instantiate Llama4MinLatencyMoE and Llama4MinLatencyGatedMLP (e.g., the
shared_expert) pass the current layer_idx and any lora_params through, and
update calls to self.feed_forward(...) to forward lora_params (and ensure
layer_idx is set on the layer instance if None). Specifically, locate
Llama4MinLatencyMoE construction and self.feed_forward usages and add the
layer_idx and lora_params parameters so shared-expert LoRA receives both
identifiers and params.
In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Around line 477-549: The forward method Llama4DecoderLayer.forward references
lora_params but never extracts it from kwargs, causing a NameError; fix by
pulling lora_params = kwargs.pop("lora_params", None) (or
kwargs.get("lora_params") if you want to leave kwargs intact) near the start of
the method (after the cutlass_min_latency_mode/residual handling) and then pass
that variable into self.feed_forward as currently done; ensure any callers
relying on kwargs still work and remove lora_params from kwargs if you used pop
to avoid unexpected extra arguments downstream.
---
Duplicate comments:
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 1258-1286: _infer_shared_expert_size_from_adapter currently
returns the full shared-expert dimension and swallows all errors; change it to
compute the per-TP size by dividing the inferred dimension by the tp_size
parameter (use integer division) and replace the bare except: with specific
exception handling (e.g., ImportError, FileNotFoundError, json.JSONDecodeError,
KeyError, AttributeError) and log the exception instead of silently passing;
update the return path to return 0 on handled errors. Ensure you update the
logic that computes inferred_dim (from tensor.shape) to assign inferred = ...
then return inferred // tp_size.
---
Nitpick comments:
In `@tests/unittest/llmapi/test_llm_pytorch.py`:
- Around line 973-977: The test currently zips logprobs_with and
logprobs_without without validating equal length and extracts the single dict
value via list(...)[0]; update the loop to use zip(logprobs_with,
logprobs_without, strict=True) so differing lengths raise immediately, and
replace list(lp_w.values())[0] and list(lp_wo.values())[0] with the idiomatic
next(iter(lp_w.values())) and next(iter(lp_wo.values())) when extracting
lp_val_w and lp_val_wo in the test that compares log probabilities.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
cpp/include/tensorrt_llm/runtime/loraModule.hcpp/tensorrt_llm/nanobind/bindings.cppcpp/tensorrt_llm/runtime/loraModule.cppcpp/tensorrt_llm/runtime/loraUtils.cpptensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytensorrt_llm/_torch/models/modeling_qwen3_next.pytensorrt_llm/_torch/models/modeling_qwen_moe.pytensorrt_llm/_torch/modules/gated_mlp.pytensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/lora_helper.pytensorrt_llm/lora_manager.pytests/unittest/llmapi/test_llm_pytorch.py
|
/bot run |
|
PR_Github #36977 [ run ] triggered by Bot. Commit: |
|
PR_Github #36977 [ run ] completed with state
|
f16ced0 to
b5e4d90
Compare
|
/bot run |
|
PR_Github #37014 [ run ] triggered by Bot. Commit: |
|
PR_Github #37014 [ run ] completed with state
|
|
/bot run |
|
PR_Github #37097 [ run ] triggered by Bot. Commit: |
|
PR_Github #37097 [ run ] completed with state
|
|
/bot run |
|
PR_Github #37108 [ run ] triggered by Bot. Commit: |
|
PR_Github #37108 [ run ] completed with state
|
|
PR_Github #37566 [ run ] triggered by Bot. Commit: |
|
PR_Github #37566 [ run ] completed with state |
|
qq i'm not sure about this - but are Qwen MoE, Qwen3Next, Llama4 verified to be the entire list of shared MoE models supported today? the PR description says it enables for all shared moe models but see changes only in the above models. |
I've only tested Qwen since it has publicly available LoRA weights, I can update the description. The llama changes are just mirroring the Qwen ones for consistency. I can test it (and other MoE models) later on. |
21b3d8b to
450bc60
Compare
xxi-nv
left a comment
There was a problem hiding this comment.
Approve the change of the Models.
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
a known prefix like shared_expert Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
35ea6a6 to
7ff08a7
Compare
|
/bot skip "ci passed" |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot skip |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot reuse-pipeline |
|
PR_Github #38664 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #38664 [ reuse-pipeline ] completed with state |
|
/bot skip --comment "ci passed" |
|
PR_Github #38665 [ skip ] triggered by Bot. Commit: |
|
PR_Github #38665 [ skip ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Description
Adds dedicated
shared_expert_*LoRA module types so that MoE models with shared experts (Qwen MoE, Qwen3Next, Llama4 - only Qwen MoE is tested) can correctly apply LoRA adapters in the PyTorch backend. Previously, shared expert LoRA failed with a C++ dimension validation error because a singlemlpHiddenSizewas used for all MLP-type LoRA modules, but MoE models have up to three different intermediate sizes (dense MLP, shared expert, routed experts).The shared expert intermediate size is inferred from the actual LoRA adapter weights (
_infer_shared_expert_size_from_adapter), not the model config. This is necessary because some adapters (e.g., upcycled models) have ashared_expert_intermediate_sizein the config that doesn't match the actual adapter weight dimensions. The function scans forshared_expert.down_proj.lora_A in the adapter, reads the LoRA rank fromadapter_config.json, and extracts the intermediate size from the non-rank dimension of the weight tensor.When a model has a shared expert (
shared_expert_hidden_size > 0), user-specifiedmlp_*target modules are automatically replaced withshared_expert_*equivalents for pure MoE models (nomlp_only_layers). This would need to be updated to support mixed (dense+MoE) models.Test Coverage
tests/unittest/llmapi/test_llm_pytorch.py::test_qwen_moe_shared_expert_loraPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.