[TRTLLM-11210][feat] Allow different TP config for draft/target models by mikeiovine · Pull Request #11838 · NVIDIA/TensorRT-LLM

mikeiovine · 2026-03-02T22:29:11Z

Summary by CodeRabbit

Release Notes

New Features
- Added draft_tp_size configuration parameter for speculative decoding to independently configure draft model tensor parallelism.
- Implemented draft mapping infrastructure across model components to support differentiated tensor parallelism between draft and target models.
Bug Fixes
- Added validation to enforce draft_tp_size consistency with target tensor parallelism constraints.
Tests
- Added test coverage for draft tensor parallelism configurations.

Description

Allow using a smaller TP size for draft models. Previously draft models had to use the same parallelism configuration as the target model.

This only works with the separated draft/target KV cache feature.

Note that this is not faster on eagle3/MTP for most usecases. The motivation for this PR is to put the basic building blocks in place for algorithms that genuinely need different draft/target TP configurations (e.g. PEARL/PEARL-like).

Test Coverage

New tests.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Mike Iovine <miovine@nvidia.com>

mikeiovine · 2026-03-06T20:31:41Z

/bot run

coderabbitai · 2026-03-06T20:42:29Z

📝 Walkthrough

Walkthrough

The PR introduces draft_tp_size support for speculative decoding, allowing draft models to use smaller tensor-parallel sizes than target models. It adds a DraftMapping class for managing draft model sub-topology, propagates draft-specific mappings throughout initialization paths, and enforces validation constraints on the new configuration option.

Changes

Cohort / File(s)	Summary
Configuration & Mapping Infrastructure `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/mapping.py`	Adds `draft_tp_size` field to `DecodingBaseConfig` with divisibility validation. Introduces `DraftMapping` class to represent draft sub-topology with reduced TP size, helper function `_get_all_tp_groups`, and factory function `create_draft_mapping` to construct appropriate mappings based on draft/target TP size ratio.
Model Architecture Updates `tensorrt_llm/_torch/models/modeling_deepseekv3.py`, `tensorrt_llm/_torch/models/modeling_speculative.py`	Adds draft mapping creation and propagation in speculative decoding. Introduces separate target TP variables; conditionally overrides TP rank/size using draft_mapping for MTP layers. Extends `get_draft_model` to accept optional `draft_model_config` parameter and wires draft_mapping through Mistral, Llama3, and external drafter initialization paths.
Attention & Context Parallel `tensorrt_llm/_torch/modules/attention.py`	Updates mapping construction for CP (context parallel) and DP (data parallel) scenarios. Conditionally reuses existing mapping or constructs new one with adjusted world_size and TP/PP dimensions based on CP/DP configuration for both `Attention` and `MLA` output projections.
Runtime Integration `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/model_loader.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Uses draft_mapping for KV cache size calculation and weight loading. Enforces validation that draft_tp_size must be ≥ target tp_size when attention DP is enabled. Validates draft_tp_size constraints in disaggregated serving mode.
Testing `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/test-db/l0_dgx_b200.yml`	Adds `DraftTargetDecodingConfig` to public exports and three new test methods (`test_eagle3_draft_tp_size`, `test_draft_target_draft_tp_size`, `test_mtp_draft_tp_size_4gpus`) validating draft_tp_size functionality across speculative decoding configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[None][feat] External Drafter One Model #11758: Modifies speculative/draft model construction paths and introduces draft-specific mapping/offset handling in similar contexts.

Suggested reviewers

Superjomn
syuoni
yuxianq

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: allowing different TP (tensor-parallel) configuration for draft versus target models, which is the primary objective of this pull request.
Description check	✅ Passed	The PR description provides a clear explanation of what the change does (allow smaller TP size for draft models), acknowledges the limitation (only works with separated draft/target KV cache), explains the motivation (building blocks for algorithms like PEARL), mentions test coverage, and includes a completed checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

tensorrt_llm/_torch/models/modeling_speculative.py (1)

10-10: Keep the mapping namespace on this import.

Please import the module and call mapping.create_draft_mapping(...) instead of importing the helper directly.

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use from package.subpackage import foo then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_speculative.py` at line 10, The file
currently imports the helper directly (from tensorrt_llm.mapping import
create_draft_mapping); change the import to keep the mapping namespace (e.g.,
import tensorrt_llm.mapping as mapping or from tensorrt_llm import mapping) and
update all call sites to use mapping.create_draft_mapping(...) instead of
create_draft_mapping(...), ensuring the referenced symbol create_draft_mapping
is only accessed through the mapping module.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

575-584: Extract this ValueError message into a constant/helper.

Ruff is already flagging this block with TRY003, so this validation will keep the file lint-red as written.

♻️ Minimal cleanup

+DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG = (
+    "draft_tp_size ({draft_tp}) < tp_size ({tp_size}) requires separate "
+    "draft/target KV caches, which are not supported with attention data "
+    "parallelism."
+)
+
         if (draft_tp is not None
                 and draft_tp < self._mapping.tp_size):
             raise ValueError(
-                f"draft_tp_size ({draft_tp}) < tp_size "
-                f"({self._mapping.tp_size}) requires separate "
-                "draft/target KV caches, which are not supported "
-                "with attention data parallelism."
+                DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG.format(
+                    draft_tp=draft_tp, tp_size=self._mapping.tp_size)
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 575 - 584, Extract the
long ValueError message into a reusable module-level constant or a small helper
formatter and use that instead of an inline f-string; for example create
DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) < tp_size ({}) requires separate
draft/target KV caches, which are not supported with attention data
parallelism." or a helper format_draft_tp_error(draft, tp) and then raise
ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp, self._mapping.tp_size)) (or
raise ValueError(format_draft_tp_error(draft_tp, self._mapping.tp_size))) where
draft_tp is computed from self._speculative_config and compared to
self._mapping.tp_size.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 1074-1084: The synthetic draft config created with
replace(model_config, mapping=draft_mapping) still carries the original
pretrained_config.num_hidden_layers (target layer count), causing KV cache
sizing to use the wrong layer count; update the draft config creation in
modeling_speculative.py (the draft_model_config variable passed into
get_draft_model) to also set the draft layer count (e.g., clone/replace
pretrained_config.num_hidden_layers or set an explicit attribute like
draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so
effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects
the actual draft layers, or alternatively propagate the draft layer count
explicitly into the KV-sizing path (ensure functions using
effective_draft_config or draft_tp_size read the draft layer count instead of
the original num_hidden_layers).

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 441-453: The fallback branch that constructs a plain Mapping loses
draft-specific topology and drops the draft mapping used by Helix
CP/attention-DP; instead of always instantiating Mapping(...) when cp_size or
dp_size != 1, preserve and reuse the draft-aware self.mapping (or copy/clone it
while adjusting only the necessary fields) so the draft topology remains intact
for qkv_proj/o_proj sharding; update the logic in the branches around the
Mapping construction (the block that currently assigns mapping = Mapping(...),
referenced by self.mapping, Mapping, cp_size, dp_size, tp_size, pp_size, and
mapping.cp_config) to maintain the draft mapping information when building
per-branch mappings (apply same fix to the other occurrences noted at the other
ranges).

In `@tensorrt_llm/mapping.py`:
- Around line 723-731: Add an explicit validation at the start of
DraftMapping.__init__ (and by extension create_draft_mapping entry paths) to
reject non-positive draft_tp_size values: if draft_tp_size <= 0 raise
ValueError("draft_tp_size must be a positive integer") before any comparisons or
the modulo check; then keep the existing equality and divisibility checks (the
equality check with target_mapping.tp_size and the target_mapping.tp_size %
draft_tp_size divisibility check) so ZeroDivisionError cannot occur and invalid
sizes are rejected with a clear ValueError.
- Around line 736-757: The code currently forces self.cp_size/self.cp_rank and
self.attn_cp_size to 1 which can silently change a target mapping that expects
CP or attention-CP layout; add a fail-fast guard before those assignments that
checks target_mapping.cp_size and target_mapping.attn_cp_size (e.g. if
target_mapping.cp_size != 1 or target_mapping.attn_cp_size != 1) and raise a
clear ValueError mentioning the unsupported combination (including
draft_tp_size, target_mapping.cp_size and target_mapping.attn_cp_size) so the
caller knows the topology mismatch; keep the existing
rank/world_size/enable_attention_dp logic but do not override CP-related fields
when the target requires CP or attention-DP.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Line 10: The file currently imports the helper directly (from
tensorrt_llm.mapping import create_draft_mapping); change the import to keep the
mapping namespace (e.g., import tensorrt_llm.mapping as mapping or from
tensorrt_llm import mapping) and update all call sites to use
mapping.create_draft_mapping(...) instead of create_draft_mapping(...), ensuring
the referenced symbol create_draft_mapping is only accessed through the mapping
module.

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 575-584: Extract the long ValueError message into a reusable
module-level constant or a small helper formatter and use that instead of an
inline f-string; for example create DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) <
tp_size ({}) requires separate draft/target KV caches, which are not supported
with attention data parallelism." or a helper format_draft_tp_error(draft, tp)
and then raise ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp,
self._mapping.tp_size)) (or raise ValueError(format_draft_tp_error(draft_tp,
self._mapping.tp_size))) where draft_tp is computed from
self._speculative_config and compared to self._mapping.tp_size.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b0d0c92-6b21-457c-89dc-954018e521e7

📥 Commits

Reviewing files that changed from the base of the PR and between 5918348 and 588035c.

📒 Files selected for processing (10)

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/mapping.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/test-db/l0_dgx_b200.yml

coderabbitai · 2026-03-06T20:42:32Z

tensorrt_llm/_torch/models/modeling_speculative.py

+                draft_model_config = None
+                if draft_mapping is not model_config.mapping:
+                    draft_model_config = replace(model_config,
+                                                 mapping=draft_mapping)
+                    if self.draft_config is None:
+                        self.draft_config = draft_model_config
+
                self.draft_model = get_draft_model(model_config,
                                                   self.draft_config,
-                                                   self.lm_head, self.model)
+                                                   self.lm_head, self.model,
+                                                   draft_model_config)


⚠️ Potential issue | 🟠 Major

The synthetic MTP draft config still carries the target layer count.

replace(model_config, mapping=draft_mapping) only swaps the mapping. The cloned config still has pretrained_config.num_hidden_layers == target_num_layers, but tensorrt_llm/_torch/pyexecutor/_util.py now sizes the separate draft KV cache from effective_draft_config. In MTP one-model mode that makes KV estimation count all target layers instead of just the draft layers, which can drastically shrink max_tokens or raise false OOMs for the new draft_tp_size path. Please either synthesize a draft config with the draft layer count, or keep the layer count explicit in the KV-size path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 1074 - 1084, The synthetic draft config created with replace(model_config, mapping=draft_mapping) still carries the original pretrained_config.num_hidden_layers (target layer count), causing KV cache sizing to use the wrong layer count; update the draft config creation in modeling_speculative.py (the draft_model_config variable passed into get_draft_model) to also set the draft layer count (e.g., clone/replace pretrained_config.num_hidden_layers or set an explicit attribute like draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects the actual draft layers, or alternatively propagate the draft layer count explicitly into the KV-sizing path (ensure functions using effective_draft_config or draft_tp_size read the draft layer count instead of the original num_hidden_layers).

coderabbitai · 2026-03-06T20:42:32Z

tensorrt_llm/_torch/modules/attention.py

+        if dp_size == 1 and cp_size == 1:
+            mapping = self.mapping
+        else:
+            mapping = Mapping(
+                world_size=dp_size * tp_size * pp_size * cp_size,
+                tp_size=tp_size,
+                pp_size=pp_size * dp_size,
+                cp_size=cp_size,
+                cp_config=self.mapping.cp_config,
+                rank=self.mapping.rank,
+                gpus_per_node=self.mapping.gpus_per_node,
+                enable_attention_dp=self.mapping.enable_attention_dp,
+            )


⚠️ Potential issue | 🟠 Major

Preserve the draft mapping in the CP/attention-DP branches.

The fast path now keeps self.mapping intact, but the fallback still rebuilds a plain Mapping. That drops the draft-specific topology this PR is introducing, so any draft run that also enables Helix CP or attention-DP can fall back to raw-rank sharding here and feed the wrong local slices into qkv_proj / o_proj.

Also applies to: 501-512, 1108-1120, 1216-1227

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/attention.py` around lines 441 - 453, The fallback branch that constructs a plain Mapping loses draft-specific topology and drops the draft mapping used by Helix CP/attention-DP; instead of always instantiating Mapping(...) when cp_size or dp_size != 1, preserve and reuse the draft-aware self.mapping (or copy/clone it while adjusting only the necessary fields) so the draft topology remains intact for qkv_proj/o_proj sharding; update the logic in the branches around the Mapping construction (the block that currently assigns mapping = Mapping(...), referenced by self.mapping, Mapping, cp_size, dp_size, tp_size, pp_size, and mapping.cp_config) to maintain the draft mapping information when building per-branch mappings (apply same fix to the other occurrences noted at the other ranges).

coderabbitai · 2026-03-06T20:42:32Z

tensorrt_llm/mapping.py

+    def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
+        if draft_tp_size == target_mapping.tp_size:
+            raise ValueError(
+                "draft_tp_size equals target tp_size; use the target mapping directly."
+            )
+        if target_mapping.tp_size % draft_tp_size != 0:
+            raise ValueError(
+                f"target tp_size ({target_mapping.tp_size}) must be divisible by "
+                f"draft_tp_size ({draft_tp_size}).")


⚠️ Potential issue | 🟡 Minor

Validate draft_tp_size before the modulo.

Line 728 throws ZeroDivisionError for draft_tp_size=0, and other non-positive values fail later with much less useful errors. Since DraftMapping / create_draft_mapping() are public entry points now, reject invalid sizes up front.

Suggested guard

def __init__(self, target_mapping: "Mapping", draft_tp_size: int): + if not isinstance(draft_tp_size, int) or draft_tp_size < 1: + raise ValueError( + f"draft_tp_size must be a positive integer, got {draft_tp_size!r}." + ) if draft_tp_size == target_mapping.tp_size: raise ValueError( "draft_tp_size equals target tp_size; use the target mapping directly." )

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 725-727: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 729-731: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/mapping.py` around lines 723 - 731, Add an explicit validation at the start of DraftMapping.__init__ (and by extension create_draft_mapping entry paths) to reject non-positive draft_tp_size values: if draft_tp_size <= 0 raise ValueError("draft_tp_size must be a positive integer") before any comparisons or the modulo check; then keep the existing equality and divisibility checks (the equality check with target_mapping.tp_size and the target_mapping.tp_size % draft_tp_size divisibility check) so ZeroDivisionError cannot occur and invalid sizes are rejected with a clear ValueError.

coderabbitai · 2026-03-06T20:42:32Z

tensorrt_llm/mapping.py

+        self.tp_size = draft_tp_size
+        self.pp_size = target_mapping.pp_size
+        self.pp_rank = target_mapping.pp_rank
+        self.cp_size = 1
+        self.cp_rank = 0
+        self.gpus_per_node = target_mapping.gpus_per_node
+
+        self.world_size = draft_tp_size * self.pp_size
+        self.enable_attention_dp = target_mapping.enable_attention_dp
+        self.enable_lm_head_tp_in_adp = target_mapping.enable_lm_head_tp_in_adp
+        # rank setter (inherited from MappingBase) validates against
+        # world_size and enable_attention_dp, so those must be set first.
+        self.rank = (self.pp_rank * draft_tp_size +
+                     target_mapping.tp_rank % draft_tp_size)
+
+        self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
+        self.moe_ep_size = 1
+        self.moe_cluster_size = 1
+        self.moe_tp_cluster_ep_size = self.moe_tp_size
+
+        self.attn_tp_size = draft_tp_size
+        self.attn_cp_size = 1


⚠️ Potential issue | 🟠 Major

Fail fast for CP / attention-DP targets.

These lines force cp_size/cp_rank/attn_cp_size to 1 while preserving the target's PP and ADP-related state. That silently changes the topology instead of rejecting an unsupported combination.

Suggested guard

def __init__(self, target_mapping: "Mapping", draft_tp_size: int): + if target_mapping.cp_size != 1 or target_mapping.enable_attention_dp: + raise NotImplementedError( + "draft_tp_size is not supported with context parallelism or attention DP yet." + ) if draft_tp_size == target_mapping.tp_size: raise ValueError( "draft_tp_size equals target tp_size; use the target mapping directly." )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/mapping.py` around lines 736 - 757, The code currently forces self.cp_size/self.cp_rank and self.attn_cp_size to 1 which can silently change a target mapping that expects CP or attention-CP layout; add a fail-fast guard before those assignments that checks target_mapping.cp_size and target_mapping.attn_cp_size (e.g. if target_mapping.cp_size != 1 or target_mapping.attn_cp_size != 1) and raise a clear ValueError mentioning the unsupported combination (including draft_tp_size, target_mapping.cp_size and target_mapping.attn_cp_size) so the caller knows the topology mismatch; keep the existing rank/world_size/enable_attention_dp logic but do not override CP-related fields when the target requires CP or attention-DP.

coderabbitai · 2026-03-06T20:42:32Z

tensorrt_llm/mapping.py

+        self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
+        self.moe_ep_size = 1
+        self.moe_cluster_size = 1
+        self.moe_tp_cluster_ep_size = self.moe_tp_size
+
+        self.attn_tp_size = draft_tp_size
+        self.attn_cp_size = 1
+
+        target_tp_rank = target_mapping.tp_rank
+        target_tp_group = list(target_mapping.tp_group)
+        target_tp_size = target_mapping.tp_size
+
+        self._tp_rank = target_tp_rank % draft_tp_size
+
+        group_index = target_tp_rank // draft_tp_size
+        start = group_index * draft_tp_size
+        self._tp_group = target_tp_group[start:start + draft_tp_size]
+
+        self._tp_group_pg = None
+        self._moe_tp_group = self._tp_group[:self.moe_tp_size]
+
+        if mpi_disabled() and draft_tp_size > 1 and torch.distributed.is_initialized():
+            all_tp_groups = _get_all_tp_groups(target_mapping)
+            num_sub = target_tp_size // draft_tp_size
+            for tp_grp in all_tp_groups:
+                for i in range(num_sub):
+                    sub_ranks = tp_grp[i * draft_tp_size:(i + 1) *
+                                       draft_tp_size]
+                    pg = torch.distributed.new_group(sub_ranks)
+                    if target_mapping.rank in sub_ranks:
+                        self._tp_group_pg = pg
+        elif mpi_disabled() and draft_tp_size == 1:
+            from tensorrt_llm._torch.device_mesh import SingleProcessGroup
+            self._tp_group_pg = SingleProcessGroup.get_group()
+
+        self.pp_groups = target_mapping.pp_groups
+        self.pp_partition = getattr(target_mapping, 'pp_partition', None)
+        self.cp_config = {}
+        self.tp_groups = []
+        self.cp_groups = []
+        self.moe_cluster_groups = []
+        self.moe_tp_groups = []
+        self.moe_ep_groups = []
+
+    def __eq__(self, other):
+        if not isinstance(other, DraftMapping):
+            return NotImplemented
+        return (self.tp_size == other.tp_size and self.rank == other.rank
+                and self._tp_group == other._tp_group)
+
+    def __hash__(self):
+        return hash((self.tp_size, self.rank, tuple(self._tp_group)))
+
+    @property
+    def tp_rank(self):
+        return self._tp_rank
+
+    @property
+    def tp_group(self):
+        return self._tp_group
+
+    @property
+    def tp_group_pg(self):
+        if self._tp_group_pg is None:
+            from tensorrt_llm._torch.device_mesh import SingleProcessGroup
+            return SingleProcessGroup.get_group()
+        return self._tp_group_pg
+
+    @property
+    def moe_tp_rank(self):
+        return self._tp_rank % self.moe_tp_size
+
+    @property
+    def moe_tp_group(self):
+        return self._moe_tp_group
+
+    @property
+    def moe_tp_group_pg(self):
+        return self.tp_group_pg


⚠️ Potential issue | 🔴 Critical

Partition MoE TP groups from the caller's slice, not always the prefix.

When draft_tp_size > target_mapping.moe_tp_size, _moe_tp_group is always built from the first moe_tp_size ranks of the draft TP group, and moe_tp_group_pg still returns the full draft TP PG. Ranks in later slices then advertise a MoE group/PG that does not contain them, which is enough to misroute or hang MoE collectives.

One way to derive the MoE subgroup from the current draft slice

self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size) + if draft_tp_size % self.moe_tp_size != 0: + raise ValueError( + f"draft_tp_size ({draft_tp_size}) must be divisible by " + f"moe_tp_size ({self.moe_tp_size})." + ) self.moe_ep_size = 1 self.moe_cluster_size = 1 self.moe_tp_cluster_ep_size = self.moe_tp_size @@ - self._tp_group_pg = None - self._moe_tp_group = self._tp_group[:self.moe_tp_size] + self._tp_group_pg = None + moe_group_start = (self._tp_rank // self.moe_tp_size) * self.moe_tp_size + self._moe_tp_group = self._tp_group[ + moe_group_start:moe_group_start + self.moe_tp_size + ] + self._moe_tp_group_pg = None @@ pg = torch.distributed.new_group(sub_ranks) if target_mapping.rank in sub_ranks: self._tp_group_pg = pg + for j in range(0, draft_tp_size, self.moe_tp_size): + moe_sub_ranks = sub_ranks[j:j + self.moe_tp_size] + moe_pg = torch.distributed.new_group(moe_sub_ranks) + if target_mapping.rank in moe_sub_ranks: + self._moe_tp_group_pg = moe_pg @@ def moe_tp_group_pg(self): - return self.tp_group_pg + return self._moe_tp_group_pg or self.tp_group_pg

xinhe-nv · 2026-03-17T02:11:36Z

tests/integration/test_lists/test-db/l0_dgx_b200.yml

  - accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_multi_gpus[latency_adp_lmtp_tp4]
  - accuracy/test_llm_api_pytorch.py::TestMiniMaxM2::test_4gpus[attention_dp=False-cuda_graph=True-overlap_scheduler=True-tp_size=4-ep_size=4] TIMEOUT (60)
+  # ------------- draft_tp_size tests ---------------
+  - accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_mtp_draft_tp_size_4gpus


please add these tests into QA test list too.

mikeiovine · 2026-03-17T19:47:12Z

Close as cannot demonstrate perf improvement on any hardware/model/dataset

mikeiovine force-pushed the draft-tp-config branch 3 times, most recently from 89e4073 to fa60f99 Compare March 5, 2026 18:50

Support different TP config for draft model

1b9e4ba

Signed-off-by: Mike Iovine <miovine@nvidia.com>

mikeiovine force-pushed the draft-tp-config branch from fa60f99 to 1b9e4ba Compare March 6, 2026 19:51

mikeiovine added 3 commits March 6, 2026 15:06

Mapping refactor

2e9c063

Signed-off-by: Mike Iovine <miovine@nvidia.com>

Add tests

344acc7

Signed-off-by: Mike Iovine <miovine@nvidia.com>

Add error messages

588035c

Signed-off-by: Mike Iovine <miovine@nvidia.com>

mikeiovine marked this pull request as ready for review March 6, 2026 20:31

mikeiovine requested review from a team as code owners March 6, 2026 20:31

mikeiovine requested review from QiJune, Superjomn, cascade812, hlu1, sunnyqgg, yechank-nvidia and ziyixiong-nv March 6, 2026 20:31

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

xinhe-nv reviewed Mar 17, 2026

View reviewed changes

mikeiovine closed this Mar 17, 2026

Conversation

mikeiovine commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

mikeiovine commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

xinhe-nv Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mikeiovine commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikeiovine commented Mar 2, 2026 •

edited by coderabbitai bot

Loading