[TRTLLM-11210][feat] Allow different TP config for draft/target models#11838
[TRTLLM-11210][feat] Allow different TP config for draft/target models#11838mikeiovine wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
89e4073 to
fa60f99
Compare
Signed-off-by: Mike Iovine <miovine@nvidia.com>
fa60f99 to
1b9e4ba
Compare
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
|
/bot run |
📝 WalkthroughWalkthroughThe PR introduces Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
10-10: Keep themappingnamespace on this import.Please import the module and call
mapping.create_draft_mapping(...)instead of importing the helper directly.As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use
from package.subpackage import foothenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_speculative.py` at line 10, The file currently imports the helper directly (from tensorrt_llm.mapping import create_draft_mapping); change the import to keep the mapping namespace (e.g., import tensorrt_llm.mapping as mapping or from tensorrt_llm import mapping) and update all call sites to use mapping.create_draft_mapping(...) instead of create_draft_mapping(...), ensuring the referenced symbol create_draft_mapping is only accessed through the mapping module.tensorrt_llm/_torch/pyexecutor/_util.py (1)
575-584: Extract thisValueErrormessage into a constant/helper.Ruff is already flagging this block with TRY003, so this validation will keep the file lint-red as written.
♻️ Minimal cleanup
+DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG = ( + "draft_tp_size ({draft_tp}) < tp_size ({tp_size}) requires separate " + "draft/target KV caches, which are not supported with attention data " + "parallelism." +) + if (draft_tp is not None and draft_tp < self._mapping.tp_size): raise ValueError( - f"draft_tp_size ({draft_tp}) < tp_size " - f"({self._mapping.tp_size}) requires separate " - "draft/target KV caches, which are not supported " - "with attention data parallelism." + DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG.format( + draft_tp=draft_tp, tp_size=self._mapping.tp_size) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 575 - 584, Extract the long ValueError message into a reusable module-level constant or a small helper formatter and use that instead of an inline f-string; for example create DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) < tp_size ({}) requires separate draft/target KV caches, which are not supported with attention data parallelism." or a helper format_draft_tp_error(draft, tp) and then raise ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp, self._mapping.tp_size)) (or raise ValueError(format_draft_tp_error(draft_tp, self._mapping.tp_size))) where draft_tp is computed from self._speculative_config and compared to self._mapping.tp_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 1074-1084: The synthetic draft config created with
replace(model_config, mapping=draft_mapping) still carries the original
pretrained_config.num_hidden_layers (target layer count), causing KV cache
sizing to use the wrong layer count; update the draft config creation in
modeling_speculative.py (the draft_model_config variable passed into
get_draft_model) to also set the draft layer count (e.g., clone/replace
pretrained_config.num_hidden_layers or set an explicit attribute like
draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so
effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects
the actual draft layers, or alternatively propagate the draft layer count
explicitly into the KV-sizing path (ensure functions using
effective_draft_config or draft_tp_size read the draft layer count instead of
the original num_hidden_layers).
In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 441-453: The fallback branch that constructs a plain Mapping loses
draft-specific topology and drops the draft mapping used by Helix
CP/attention-DP; instead of always instantiating Mapping(...) when cp_size or
dp_size != 1, preserve and reuse the draft-aware self.mapping (or copy/clone it
while adjusting only the necessary fields) so the draft topology remains intact
for qkv_proj/o_proj sharding; update the logic in the branches around the
Mapping construction (the block that currently assigns mapping = Mapping(...),
referenced by self.mapping, Mapping, cp_size, dp_size, tp_size, pp_size, and
mapping.cp_config) to maintain the draft mapping information when building
per-branch mappings (apply same fix to the other occurrences noted at the other
ranges).
In `@tensorrt_llm/mapping.py`:
- Around line 723-731: Add an explicit validation at the start of
DraftMapping.__init__ (and by extension create_draft_mapping entry paths) to
reject non-positive draft_tp_size values: if draft_tp_size <= 0 raise
ValueError("draft_tp_size must be a positive integer") before any comparisons or
the modulo check; then keep the existing equality and divisibility checks (the
equality check with target_mapping.tp_size and the target_mapping.tp_size %
draft_tp_size divisibility check) so ZeroDivisionError cannot occur and invalid
sizes are rejected with a clear ValueError.
- Around line 736-757: The code currently forces self.cp_size/self.cp_rank and
self.attn_cp_size to 1 which can silently change a target mapping that expects
CP or attention-CP layout; add a fail-fast guard before those assignments that
checks target_mapping.cp_size and target_mapping.attn_cp_size (e.g. if
target_mapping.cp_size != 1 or target_mapping.attn_cp_size != 1) and raise a
clear ValueError mentioning the unsupported combination (including
draft_tp_size, target_mapping.cp_size and target_mapping.attn_cp_size) so the
caller knows the topology mismatch; keep the existing
rank/world_size/enable_attention_dp logic but do not override CP-related fields
when the target requires CP or attention-DP.
---
Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Line 10: The file currently imports the helper directly (from
tensorrt_llm.mapping import create_draft_mapping); change the import to keep the
mapping namespace (e.g., import tensorrt_llm.mapping as mapping or from
tensorrt_llm import mapping) and update all call sites to use
mapping.create_draft_mapping(...) instead of create_draft_mapping(...), ensuring
the referenced symbol create_draft_mapping is only accessed through the mapping
module.
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 575-584: Extract the long ValueError message into a reusable
module-level constant or a small helper formatter and use that instead of an
inline f-string; for example create DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) <
tp_size ({}) requires separate draft/target KV caches, which are not supported
with attention data parallelism." or a helper format_draft_tp_error(draft, tp)
and then raise ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp,
self._mapping.tp_size)) (or raise ValueError(format_draft_tp_error(draft_tp,
self._mapping.tp_size))) where draft_tp is computed from
self._speculative_config and compared to self._mapping.tp_size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b0d0c92-6b21-457c-89dc-954018e521e7
📒 Files selected for processing (10)
tensorrt_llm/_torch/models/modeling_deepseekv3.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/model_loader.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/mapping.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/test_lists/test-db/l0_dgx_b200.yml
| draft_model_config = None | ||
| if draft_mapping is not model_config.mapping: | ||
| draft_model_config = replace(model_config, | ||
| mapping=draft_mapping) | ||
| if self.draft_config is None: | ||
| self.draft_config = draft_model_config | ||
|
|
||
| self.draft_model = get_draft_model(model_config, | ||
| self.draft_config, | ||
| self.lm_head, self.model) | ||
| self.lm_head, self.model, | ||
| draft_model_config) |
There was a problem hiding this comment.
The synthetic MTP draft config still carries the target layer count.
replace(model_config, mapping=draft_mapping) only swaps the mapping. The cloned config still has pretrained_config.num_hidden_layers == target_num_layers, but tensorrt_llm/_torch/pyexecutor/_util.py now sizes the separate draft KV cache from effective_draft_config. In MTP one-model mode that makes KV estimation count all target layers instead of just the draft layers, which can drastically shrink max_tokens or raise false OOMs for the new draft_tp_size path. Please either synthesize a draft config with the draft layer count, or keep the layer count explicit in the KV-size path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 1074 - 1084,
The synthetic draft config created with replace(model_config,
mapping=draft_mapping) still carries the original
pretrained_config.num_hidden_layers (target layer count), causing KV cache
sizing to use the wrong layer count; update the draft config creation in
modeling_speculative.py (the draft_model_config variable passed into
get_draft_model) to also set the draft layer count (e.g., clone/replace
pretrained_config.num_hidden_layers or set an explicit attribute like
draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so
effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects
the actual draft layers, or alternatively propagate the draft layer count
explicitly into the KV-sizing path (ensure functions using
effective_draft_config or draft_tp_size read the draft layer count instead of
the original num_hidden_layers).
| if dp_size == 1 and cp_size == 1: | ||
| mapping = self.mapping | ||
| else: | ||
| mapping = Mapping( | ||
| world_size=dp_size * tp_size * pp_size * cp_size, | ||
| tp_size=tp_size, | ||
| pp_size=pp_size * dp_size, | ||
| cp_size=cp_size, | ||
| cp_config=self.mapping.cp_config, | ||
| rank=self.mapping.rank, | ||
| gpus_per_node=self.mapping.gpus_per_node, | ||
| enable_attention_dp=self.mapping.enable_attention_dp, | ||
| ) |
There was a problem hiding this comment.
Preserve the draft mapping in the CP/attention-DP branches.
The fast path now keeps self.mapping intact, but the fallback still rebuilds a plain Mapping. That drops the draft-specific topology this PR is introducing, so any draft run that also enables Helix CP or attention-DP can fall back to raw-rank sharding here and feed the wrong local slices into qkv_proj / o_proj.
Also applies to: 501-512, 1108-1120, 1216-1227
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/attention.py` around lines 441 - 453, The
fallback branch that constructs a plain Mapping loses draft-specific topology
and drops the draft mapping used by Helix CP/attention-DP; instead of always
instantiating Mapping(...) when cp_size or dp_size != 1, preserve and reuse the
draft-aware self.mapping (or copy/clone it while adjusting only the necessary
fields) so the draft topology remains intact for qkv_proj/o_proj sharding;
update the logic in the branches around the Mapping construction (the block that
currently assigns mapping = Mapping(...), referenced by self.mapping, Mapping,
cp_size, dp_size, tp_size, pp_size, and mapping.cp_config) to maintain the draft
mapping information when building per-branch mappings (apply same fix to the
other occurrences noted at the other ranges).
| def __init__(self, target_mapping: "Mapping", draft_tp_size: int): | ||
| if draft_tp_size == target_mapping.tp_size: | ||
| raise ValueError( | ||
| "draft_tp_size equals target tp_size; use the target mapping directly." | ||
| ) | ||
| if target_mapping.tp_size % draft_tp_size != 0: | ||
| raise ValueError( | ||
| f"target tp_size ({target_mapping.tp_size}) must be divisible by " | ||
| f"draft_tp_size ({draft_tp_size}).") |
There was a problem hiding this comment.
Validate draft_tp_size before the modulo.
Line 728 throws ZeroDivisionError for draft_tp_size=0, and other non-positive values fail later with much less useful errors. Since DraftMapping / create_draft_mapping() are public entry points now, reject invalid sizes up front.
Suggested guard
def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
+ if not isinstance(draft_tp_size, int) or draft_tp_size < 1:
+ raise ValueError(
+ f"draft_tp_size must be a positive integer, got {draft_tp_size!r}."
+ )
if draft_tp_size == target_mapping.tp_size:
raise ValueError(
"draft_tp_size equals target tp_size; use the target mapping directly."
)🧰 Tools
🪛 Ruff (0.15.4)
[warning] 725-727: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 729-731: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/mapping.py` around lines 723 - 731, Add an explicit validation
at the start of DraftMapping.__init__ (and by extension create_draft_mapping
entry paths) to reject non-positive draft_tp_size values: if draft_tp_size <= 0
raise ValueError("draft_tp_size must be a positive integer") before any
comparisons or the modulo check; then keep the existing equality and
divisibility checks (the equality check with target_mapping.tp_size and the
target_mapping.tp_size % draft_tp_size divisibility check) so ZeroDivisionError
cannot occur and invalid sizes are rejected with a clear ValueError.
| self.tp_size = draft_tp_size | ||
| self.pp_size = target_mapping.pp_size | ||
| self.pp_rank = target_mapping.pp_rank | ||
| self.cp_size = 1 | ||
| self.cp_rank = 0 | ||
| self.gpus_per_node = target_mapping.gpus_per_node | ||
|
|
||
| self.world_size = draft_tp_size * self.pp_size | ||
| self.enable_attention_dp = target_mapping.enable_attention_dp | ||
| self.enable_lm_head_tp_in_adp = target_mapping.enable_lm_head_tp_in_adp | ||
| # rank setter (inherited from MappingBase) validates against | ||
| # world_size and enable_attention_dp, so those must be set first. | ||
| self.rank = (self.pp_rank * draft_tp_size + | ||
| target_mapping.tp_rank % draft_tp_size) | ||
|
|
||
| self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size) | ||
| self.moe_ep_size = 1 | ||
| self.moe_cluster_size = 1 | ||
| self.moe_tp_cluster_ep_size = self.moe_tp_size | ||
|
|
||
| self.attn_tp_size = draft_tp_size | ||
| self.attn_cp_size = 1 |
There was a problem hiding this comment.
Fail fast for CP / attention-DP targets.
These lines force cp_size/cp_rank/attn_cp_size to 1 while preserving the target's PP and ADP-related state. That silently changes the topology instead of rejecting an unsupported combination.
Suggested guard
def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
+ if target_mapping.cp_size != 1 or target_mapping.enable_attention_dp:
+ raise NotImplementedError(
+ "draft_tp_size is not supported with context parallelism or attention DP yet."
+ )
if draft_tp_size == target_mapping.tp_size:
raise ValueError(
"draft_tp_size equals target tp_size; use the target mapping directly."
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/mapping.py` around lines 736 - 757, The code currently forces
self.cp_size/self.cp_rank and self.attn_cp_size to 1 which can silently change a
target mapping that expects CP or attention-CP layout; add a fail-fast guard
before those assignments that checks target_mapping.cp_size and
target_mapping.attn_cp_size (e.g. if target_mapping.cp_size != 1 or
target_mapping.attn_cp_size != 1) and raise a clear ValueError mentioning the
unsupported combination (including draft_tp_size, target_mapping.cp_size and
target_mapping.attn_cp_size) so the caller knows the topology mismatch; keep the
existing rank/world_size/enable_attention_dp logic but do not override
CP-related fields when the target requires CP or attention-DP.
| self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size) | ||
| self.moe_ep_size = 1 | ||
| self.moe_cluster_size = 1 | ||
| self.moe_tp_cluster_ep_size = self.moe_tp_size | ||
|
|
||
| self.attn_tp_size = draft_tp_size | ||
| self.attn_cp_size = 1 | ||
|
|
||
| target_tp_rank = target_mapping.tp_rank | ||
| target_tp_group = list(target_mapping.tp_group) | ||
| target_tp_size = target_mapping.tp_size | ||
|
|
||
| self._tp_rank = target_tp_rank % draft_tp_size | ||
|
|
||
| group_index = target_tp_rank // draft_tp_size | ||
| start = group_index * draft_tp_size | ||
| self._tp_group = target_tp_group[start:start + draft_tp_size] | ||
|
|
||
| self._tp_group_pg = None | ||
| self._moe_tp_group = self._tp_group[:self.moe_tp_size] | ||
|
|
||
| if mpi_disabled() and draft_tp_size > 1 and torch.distributed.is_initialized(): | ||
| all_tp_groups = _get_all_tp_groups(target_mapping) | ||
| num_sub = target_tp_size // draft_tp_size | ||
| for tp_grp in all_tp_groups: | ||
| for i in range(num_sub): | ||
| sub_ranks = tp_grp[i * draft_tp_size:(i + 1) * | ||
| draft_tp_size] | ||
| pg = torch.distributed.new_group(sub_ranks) | ||
| if target_mapping.rank in sub_ranks: | ||
| self._tp_group_pg = pg | ||
| elif mpi_disabled() and draft_tp_size == 1: | ||
| from tensorrt_llm._torch.device_mesh import SingleProcessGroup | ||
| self._tp_group_pg = SingleProcessGroup.get_group() | ||
|
|
||
| self.pp_groups = target_mapping.pp_groups | ||
| self.pp_partition = getattr(target_mapping, 'pp_partition', None) | ||
| self.cp_config = {} | ||
| self.tp_groups = [] | ||
| self.cp_groups = [] | ||
| self.moe_cluster_groups = [] | ||
| self.moe_tp_groups = [] | ||
| self.moe_ep_groups = [] | ||
|
|
||
| def __eq__(self, other): | ||
| if not isinstance(other, DraftMapping): | ||
| return NotImplemented | ||
| return (self.tp_size == other.tp_size and self.rank == other.rank | ||
| and self._tp_group == other._tp_group) | ||
|
|
||
| def __hash__(self): | ||
| return hash((self.tp_size, self.rank, tuple(self._tp_group))) | ||
|
|
||
| @property | ||
| def tp_rank(self): | ||
| return self._tp_rank | ||
|
|
||
| @property | ||
| def tp_group(self): | ||
| return self._tp_group | ||
|
|
||
| @property | ||
| def tp_group_pg(self): | ||
| if self._tp_group_pg is None: | ||
| from tensorrt_llm._torch.device_mesh import SingleProcessGroup | ||
| return SingleProcessGroup.get_group() | ||
| return self._tp_group_pg | ||
|
|
||
| @property | ||
| def moe_tp_rank(self): | ||
| return self._tp_rank % self.moe_tp_size | ||
|
|
||
| @property | ||
| def moe_tp_group(self): | ||
| return self._moe_tp_group | ||
|
|
||
| @property | ||
| def moe_tp_group_pg(self): | ||
| return self.tp_group_pg |
There was a problem hiding this comment.
Partition MoE TP groups from the caller's slice, not always the prefix.
When draft_tp_size > target_mapping.moe_tp_size, _moe_tp_group is always built from the first moe_tp_size ranks of the draft TP group, and moe_tp_group_pg still returns the full draft TP PG. Ranks in later slices then advertise a MoE group/PG that does not contain them, which is enough to misroute or hang MoE collectives.
One way to derive the MoE subgroup from the current draft slice
self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
+ if draft_tp_size % self.moe_tp_size != 0:
+ raise ValueError(
+ f"draft_tp_size ({draft_tp_size}) must be divisible by "
+ f"moe_tp_size ({self.moe_tp_size})."
+ )
self.moe_ep_size = 1
self.moe_cluster_size = 1
self.moe_tp_cluster_ep_size = self.moe_tp_size
@@
- self._tp_group_pg = None
- self._moe_tp_group = self._tp_group[:self.moe_tp_size]
+ self._tp_group_pg = None
+ moe_group_start = (self._tp_rank // self.moe_tp_size) * self.moe_tp_size
+ self._moe_tp_group = self._tp_group[
+ moe_group_start:moe_group_start + self.moe_tp_size
+ ]
+ self._moe_tp_group_pg = None
@@
pg = torch.distributed.new_group(sub_ranks)
if target_mapping.rank in sub_ranks:
self._tp_group_pg = pg
+ for j in range(0, draft_tp_size, self.moe_tp_size):
+ moe_sub_ranks = sub_ranks[j:j + self.moe_tp_size]
+ moe_pg = torch.distributed.new_group(moe_sub_ranks)
+ if target_mapping.rank in moe_sub_ranks:
+ self._moe_tp_group_pg = moe_pg
@@
def moe_tp_group_pg(self):
- return self.tp_group_pg
+ return self._moe_tp_group_pg or self.tp_group_pg| - accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_multi_gpus[latency_adp_lmtp_tp4] | ||
| - accuracy/test_llm_api_pytorch.py::TestMiniMaxM2::test_4gpus[attention_dp=False-cuda_graph=True-overlap_scheduler=True-tp_size=4-ep_size=4] TIMEOUT (60) | ||
| # ------------- draft_tp_size tests --------------- | ||
| - accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_mtp_draft_tp_size_4gpus |
There was a problem hiding this comment.
please add these tests into QA test list too.
|
Close as cannot demonstrate perf improvement on any hardware/model/dataset |
Summary by CodeRabbit
Release Notes
New Features
draft_tp_sizeconfiguration parameter for speculative decoding to independently configure draft model tensor parallelism.Bug Fixes
draft_tp_sizeconsistency with target tensor parallelism constraints.Tests
Description
Allow using a smaller TP size for draft models. Previously draft models had to use the same parallelism configuration as the target model.
This only works with the separated draft/target KV cache feature.
Note that this is not faster on eagle3/MTP for most usecases. The motivation for this PR is to put the basic building blocks in place for algorithms that genuinely need different draft/target TP configurations (e.g. PEARL/PEARL-like).
Test Coverage
New tests.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.