[TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models#12154
[TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models#12154Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR extends LoRA support to Mamba and MoE architectures by introducing four new module types (Mamba in/out projections and MoE latent up/down), updating model configuration to track LoRA-applicable layers separately from attention layers, and integrating LoRA parameters throughout Nemotron-H and Mamba2 forward paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can get early access to new features in CodeRabbit.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
855-904:⚠️ Potential issue | 🟠 MajorThe new router-LoRA path is never wired in.
Deepseekv3MoEstill instantiatesgate_cls(... moe_backend=model_config.moe_backend)on Lines 960-969 withoutlora_config, andcompute_routed_output()still callsself.gate(hidden_states)on Line 1154 withoutlora_paramsorlayer_idx. As written,self.gate_lorastaysNone, so this new code never applies a router delta.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py` around lines 855 - 904, The router LoRA is never used because Deepseekv3MoE constructs the gate without passing lora_config and compute_routed_output calls the gate without lora params; fix by (1) passing lora_config into the gate when instantiating gate_cls (update the call that currently does gate_cls(... moe_backend=model_config.moe_backend) to include lora_config=lora_config) and (2) update compute_routed_output to call the gate with the LoRA args (replace self.gate(hidden_states) with self.gate(hidden_states, lora_params, layer_idx)) so DeepseekV3Gate.gate_lora can be created and applied.tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
926-940:⚠️ Potential issue | 🟠 MajorMTP LoRA lookup is ambiguous for repeated sublayer types.
These calls now thread
lora_paramsinto every MTP sublayer, but eachNemotronHMTPDecoderLayerin this block is constructed with the samelayer_idx=self.layer_idx, andLoraLayerlookup is keyed bylayer_idx. Ifmtp_hybrid_override_patterncontains the same LoRA-capable type more than once, those sublayers will alias the same adapter slot instead of getting distinct weights. Please give each MTP sublayer a unique LoRA key/index, or keep MTP out of the LoRA path until that addressing exists.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` around lines 926 - 940, The MTP sublayers are all constructed/queried with the same layer_idx (self.layer_idx), causing LoraLayer lookup collisions when mtp_hybrid_override_pattern repeats types; update the loop in the forward path that iterates self.pattern_len so each MTP sublayer gets a unique LoRA key—e.g., compute a unique_layer_idx using the loop index (like unique_layer_idx = self.layer_idx * self.pattern_len + i or unique_layer_idx = (self.layer_idx, i) flattened) and pass that into the NemotronHMTPDecoderLayer call instead of self.layer_idx (or alternatively omit passing lora_params into MTP sublayers until unique addressing exists); change references to layer(..., lora_params=lora_params, ...) to use the computed unique_layer_idx so LoraLayer lookups are distinct.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
45-45: Use a module import for the LoRA symbols.This new direct symbol import goes against the repo’s Python import rule. Please import the module and reference
layer.LoraLayer/layer.LoraModuleTypeinstead.As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` at line 45, Replace the direct symbol import from ..peft.lora.layer with a module import and update usages to keep the namespace; specifically, change the import so you import the module (e.g., import ..peft.lora.layer as layer or from ..peft.lora import layer) and then reference layer.LoraLayer and layer.LoraModuleType in modeling_nemotron_h.py wherever those symbols are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/tensorrt_llm/runtime/modelConfig.h`:
- Around line 234-243: getNbLoraLayers currently floor-divides mNbLoraLayers by
pipelineParallelism and ignores pipelineParallelismRank, silently dropping tail
layers on uneven splits; change getNbLoraLayers to be rank-aware: when
mNbLoraLayers > 0 compute base = mNbLoraLayers / pipelineParallelism and rem =
mNbLoraLayers % pipelineParallelism and return base + (pipelineParallelismRank <
rem ? 1 : 0); ensure you still validate pipelineParallelism>0 with
TLLM_CHECK_WITH_INFO and use pipelineParallelismRank to distribute the remainder
(alternatively, if you prefer failing fast instead of distributing, add a
TLLM_CHECK_WITH_INFO that mNbLoraLayers % pipelineParallelism == 0 and error
out), referencing getNbLoraLayers, mNbLoraLayers and pipelineParallelismRank in
your change so loraCache/loraManager receive correct per-rank counts.
In `@cpp/tensorrt_llm/runtime/loraModule.cpp`:
- Around line 68-70: The Mamba projections are using the single mlpHidden value
for both in_proj and out_proj shapes which is incorrect; create and use two
separate dimensions (e.g., mambaInProjDim for d_in_proj and mambaOutProjDim for
d_inner) instead of mlpHidden and pass them to the ModuleType constructors: for
ModuleType::kMAMBA_IN_PROJ call modules.emplace_back(t, hidden, mambaInProjDim,
false, true, -1, 0) and for ModuleType::kMAMBA_OUT_PROJ call
modules.emplace_back(t, mambaOutProjDim, hidden, false, true, 1, -1); populate
mambaInProjDim and mambaOutProjDim from the Mamba layer descriptor/config (the
same source that builds d_in_proj and d_inner in mamba2_mixer.py) so packed LoRA
shapes match the runtime descriptors.
In `@examples/llm-api/quickstart_advanced.py`:
- Around line 303-304: The code currently constructs LoraConfig with only
lora_dir (lora_config = LoraConfig(lora_dir=[args.lora_dir]) if args.lora_dir
else None) which relies on the default max_lora_rank=64 and will fail for
adapters whose r > 64; update the logic that builds LoraConfig to determine the
adapter rank automatically (by reading adapter_config.json inside args.lora_dir
and extracting the "r" value) or expose a new CLI flag (e.g., --max_lora_rank)
and pass it as max_lora_rank to LoraConfig; locate the LoraConfig instantiation
and replace it with code that reads adapter_config.json when args.lora_dir is
present, parses the rank, and sets LoraConfig(lora_dir=[args.lora_dir],
max_lora_rank=parsed_rank) (falling back to the flag or default if parsing
fails).
In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py`:
- Line 855: Add a type-only import for LoraConfig so static checkers stop
flagging the annotation; import TYPE_CHECKING from typing (or typing_extensions)
at the top and inside an if TYPE_CHECKING: block add "from peft import
LoraConfig" (or the correct module that defines LoraConfig), leaving the
parameter annotation lora_config: Optional["LoraConfig"] unchanged at runtime.
In `@tensorrt_llm/_torch/modules/mlp.py`:
- Around line 55-57: The LoRA width uses the static config TP size instead of
the effective/sharded TP size when TP is overridden; change the up_lora
construction so its width uses the effective TP size (i.e., the overridden TP if
present) rather than config.mapping.tp_size—compute an effective_tp (e.g.,
self.overridden_tp_size or self.mapping.tp_size/config.mapping.tp_size) and pass
[self.intermediate_size // effective_tp] to LoraLayer for self.up_lora so
x_up_lora matches up_proj’s sharding.
In `@tensorrt_llm/_torch/peft/lora/layer.py`:
- Around line 80-81: The is_moe helper currently omits the new MOE_LATENT_UP and
MOE_LATENT_DOWN constants, so add those two constants to the set/check within
is_moe (the function named is_moe and the enum/constants MOE_LATENT_UP and
MOE_LATENT_DOWN and existing MOE_... members) so that it returns True for those
projection types; update any conditional or membership test inside is_moe to
include MOE_LATENT_UP and MOE_LATENT_DOWN alongside the other MOE_* values to
ensure latent projections are treated as MoE.
In `@tensorrt_llm/lora_helper.py`:
- Around line 56-57: The annotations are incorrect: update any type hints
declared as Dict[str, str] for trtllm->hf mappings to allow lists (e.g.,
Dict[str, Union[str, List[str]]]) so they match
get_default_trtllm_modules_to_hf_modules() which returns values that can be str
or List[str]; specifically change the return/type of use_lora(), the
LoraConfig.trtllm_modules_to_hf_modules attribute, and any related signatures or
type aliases referencing that mapping to use Union[str, List[str]] (or a shared
alias like DictStrOrList) so static checkers reflect the actual data shape.
In
`@tests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py`:
- Around line 53-75: The test currently writes all lora_B.weight tensors as
zeros so LoRA deltas are always zero and the test cannot detect broken LoRA
application; update the test in
tests_lora_modules/test_nemotron_h_lora_sanity.py to (1) set at least some
lora_B.weight tensors to non-zero values (e.g., random small values) instead of
torch.zeros for keys built from proj_dims and attn_layers, and (2) expand the
targeted keys to include the new Mamba/shared-expert/latent LoRA paths (in
addition to "q_proj","k_proj","v_proj","o_proj") so the code paths that apply
those adapter types are exercised; adjust the saved adapter_model.safetensors
generation accordingly and keep the existing assertions.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_deepseekv3.py`:
- Around line 855-904: The router LoRA is never used because Deepseekv3MoE
constructs the gate without passing lora_config and compute_routed_output calls
the gate without lora params; fix by (1) passing lora_config into the gate when
instantiating gate_cls (update the call that currently does gate_cls(...
moe_backend=model_config.moe_backend) to include lora_config=lora_config) and
(2) update compute_routed_output to call the gate with the LoRA args (replace
self.gate(hidden_states) with self.gate(hidden_states, lora_params, layer_idx))
so DeepseekV3Gate.gate_lora can be created and applied.
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 926-940: The MTP sublayers are all constructed/queried with the
same layer_idx (self.layer_idx), causing LoraLayer lookup collisions when
mtp_hybrid_override_pattern repeats types; update the loop in the forward path
that iterates self.pattern_len so each MTP sublayer gets a unique LoRA key—e.g.,
compute a unique_layer_idx using the loop index (like unique_layer_idx =
self.layer_idx * self.pattern_len + i or unique_layer_idx = (self.layer_idx, i)
flattened) and pass that into the NemotronHMTPDecoderLayer call instead of
self.layer_idx (or alternatively omit passing lora_params into MTP sublayers
until unique addressing exists); change references to layer(...,
lora_params=lora_params, ...) to use the computed unique_layer_idx so LoraLayer
lookups are distinct.
---
Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Line 45: Replace the direct symbol import from ..peft.lora.layer with a module
import and update usages to keep the namespace; specifically, change the import
so you import the module (e.g., import ..peft.lora.layer as layer or from
..peft.lora import layer) and then reference layer.LoraLayer and
layer.LoraModuleType in modeling_nemotron_h.py wherever those symbols are used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 604792d5-9e00-495d-8bba-f9819862a862
📒 Files selected for processing (17)
cpp/include/tensorrt_llm/runtime/loraModule.hcpp/include/tensorrt_llm/runtime/modelConfig.hcpp/tensorrt_llm/nanobind/bindings.cppcpp/tensorrt_llm/runtime/loraCache.cppcpp/tensorrt_llm/runtime/loraManager.cppcpp/tensorrt_llm/runtime/loraModule.cppcpp/tensorrt_llm/runtime/loraUtils.cppexamples/llm-api/quickstart_advanced.pytensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_deepseekv3.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mlp.pytensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/lora_helper.pytensorrt_llm/lora_manager.pytests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py
tests/unittest/_torch/modules/tests_lora_modules/test_nemotron_h_lora_sanity.py
Outdated
Show resolved
Hide resolved
e734e9b to
a2bbc7f
Compare
db63c94 to
adb407e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #38840 [ run ] triggered by Bot. Commit: |
|
PR_Github #38840 [ run ] completed with state
|
adb407e to
3a699fe
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #39404 [ run ] triggered by Bot. Commit: |
|
PR_Github #39404 [ run ] completed with state
|
1a38698 to
37f6ddf
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #39540 [ run ] triggered by Bot. Commit: |
|
PR_Github #39540 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39597 [ run ] triggered by Bot. Commit: |
|
PR_Github #39597 [ run ] completed with state |
| MAMBA_OUT_PROJ = 23 # Mamba output projection | ||
|
|
||
| MOE_LATENT_UP = 24 # MoE latent up projection (fc1_latent_proj) | ||
| MOE_LATENT_DOWN = 25 # MoE latent down projection (fc2_latent_proj) |
There was a problem hiding this comment.
Are we sure that FC1 is up and FC2 is down ?
Maybe it's the opposite ?
AFAIU -
The down projection maps from a higher-dimensional space to a lower-dimensional one.
The up projection maps from a lower-dimensional space back to a higher-dimensional one.
The latent MoE is in the "lower-dimensional space".
There was a problem hiding this comment.
i renamed them as MOE_LATENT_FC1/FC2, which can avoid confusion and match the naming in the model weights.
| // Mamba dimensions: in_proj outputs d_in_proj, out_proj inputs d_inner | ||
| // Fall back to mlpHidden if not specified (for backward compatibility) | ||
| auto const mambaInProj = mambaInProjSize > 0 ? mambaInProjSize * tpSize : mlpHidden; | ||
| auto const mambaInner = mambaInnerSize > 0 ? mambaInnerSize * tpSize : mlpHidden; |
There was a problem hiding this comment.
Is it possible for only mambaInProjSize or mambaInnerSize to be non-zero?
I assume that either both are non-zero (nemotron model) or both are zero (non-nemotron model).
I am not sure if an error should be raised if invalid input is given (or alternatively, pass mambaInProjSize and mambaInnerSize as a "tuple"/struct).
I see that existing inputs like hiddenSize, tpSize are also not validated here (they can be zero AFAIU), so validation is probably done at a higher level of the code.
There was a problem hiding this comment.
i added check, and set mambaInProj=mambaInProjSize * tpSize directly. No need to use mlpHidden as fallback value.
| (7, 3), # 3 + 2 + 2 | ||
| (1, 2), # Edge: 1 + 0 | ||
| (32, 1), # Single rank | ||
| ], |
There was a problem hiding this comment.
Can use the sizes/dims of Nemotron Nano/Super as test cases.
(For other tests too).
There was a problem hiding this comment.
it is actually added, see 88 layers, and other values.
| if moe_intermediate is not None and moe_intermediate > 0: | ||
| moe_hidden_size = moe_intermediate // mapping.tp_size | ||
|
|
||
| # Mamba dimensions for hybrid models (e.g., Nemotron-H) |
There was a problem hiding this comment.
nit: The dimension resolution block here is getting unwieldy — each new model type adds another ad-hoc getattr + TP-division block. Worth a follow-up to extract a LoraModuleDimensions dataclass or similar. Not blocking this PR since you're following the existing pattern - but could you add a TODO to refactor this more cleanly?
There was a problem hiding this comment.
added a todo comment at the begining of this code part.
|
|
||
| Called from post_load_weights (weights don't exist during __init__). | ||
| """ | ||
| # Don't use fused NVFP4 output if LoRA is enabled on out_proj. |
There was a problem hiding this comment.
i don't know anything about mamba - but i saw that devin had pointed towards a bug here : https://app.devin.ai/review/NVIDIA/TensorRT-LLM/pull/12154#bug-BUG_pr-review-job-d3cde4cb72cb4f4889031597155a32c4_0001
it says this that out_proj's lora has been guarded against getting nvfp4 tensors from the fused norm - but in_proj's lora hasn't been appropriately guarded when the layernorm outputs nvfp4 tensors. But looks like currently the layernorm will never output a nvfp4 tensor as the hidden size is not supported... but tomorrow this could expose itself as a bug... worth adding a todo atleast?
There was a problem hiding this comment.
the bug is there.
I added a strong guard that if lora is enabled, we fallback to rms_norm legacy kernel, which produced bf16 input for lora layers.
37f6ddf to
eb20fca
Compare
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
eb20fca to
6b7f23e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40254 [ run ] triggered by Bot. Commit: |
|
PR_Github #40254 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40473 [ run ] triggered by Bot. Commit: |
|
PR_Github #40473 [ run ] completed with state
|
Features
Support following moduel LoRAs:
Attention Q/K/V/O proj
MoE_shared_expert
Mamba in/out proj
MoE latent_proj
Not Supporte yet:
MoE route_experts
MoE gate_proj
Summary by CodeRabbit
--lora_dirCLI flag to the LLM API for simplified LoRA configuration.Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.