Skip to content

[TRTLLM-11210][feat] Allow different TP config for draft/target models#11838

Closed
mikeiovine wants to merge 4 commits intoNVIDIA:mainfrom
mikeiovine:draft-tp-config
Closed

[TRTLLM-11210][feat] Allow different TP config for draft/target models#11838
mikeiovine wants to merge 4 commits intoNVIDIA:mainfrom
mikeiovine:draft-tp-config

Conversation

@mikeiovine
Copy link
Collaborator

@mikeiovine mikeiovine commented Mar 2, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added draft_tp_size configuration parameter for speculative decoding to independently configure draft model tensor parallelism.
    • Implemented draft mapping infrastructure across model components to support differentiated tensor parallelism between draft and target models.
  • Bug Fixes

    • Added validation to enforce draft_tp_size consistency with target tensor parallelism constraints.
  • Tests

    • Added test coverage for draft tensor parallelism configurations.

Description

Allow using a smaller TP size for draft models. Previously draft models had to use the same parallelism configuration as the target model.

This only works with the separated draft/target KV cache feature.

Note that this is not faster on eagle3/MTP for most usecases. The motivation for this PR is to put the basic building blocks in place for algorithms that genuinely need different draft/target TP configurations (e.g. PEARL/PEARL-like).

Test Coverage

New tests.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@mikeiovine mikeiovine force-pushed the draft-tp-config branch 3 times, most recently from 89e4073 to fa60f99 Compare March 5, 2026 18:50
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
@mikeiovine mikeiovine marked this pull request as ready for review March 6, 2026 20:31
@mikeiovine mikeiovine requested review from a team as code owners March 6, 2026 20:31
@mikeiovine
Copy link
Collaborator Author

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

The PR introduces draft_tp_size support for speculative decoding, allowing draft models to use smaller tensor-parallel sizes than target models. It adds a DraftMapping class for managing draft model sub-topology, propagates draft-specific mappings throughout initialization paths, and enforces validation constraints on the new configuration option.

Changes

Cohort / File(s) Summary
Configuration & Mapping Infrastructure
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/mapping.py
Adds draft_tp_size field to DecodingBaseConfig with divisibility validation. Introduces DraftMapping class to represent draft sub-topology with reduced TP size, helper function _get_all_tp_groups, and factory function create_draft_mapping to construct appropriate mappings based on draft/target TP size ratio.
Model Architecture Updates
tensorrt_llm/_torch/models/modeling_deepseekv3.py, tensorrt_llm/_torch/models/modeling_speculative.py
Adds draft mapping creation and propagation in speculative decoding. Introduces separate target TP variables; conditionally overrides TP rank/size using draft_mapping for MTP layers. Extends get_draft_model to accept optional draft_model_config parameter and wires draft_mapping through Mistral, Llama3, and external drafter initialization paths.
Attention & Context Parallel
tensorrt_llm/_torch/modules/attention.py
Updates mapping construction for CP (context parallel) and DP (data parallel) scenarios. Conditionally reuses existing mapping or constructs new one with adjusted world_size and TP/PP dimensions based on CP/DP configuration for both Attention and MLA output projections.
Runtime Integration
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/model_loader.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Uses draft_mapping for KV cache size calculation and weight loading. Enforces validation that draft_tp_size must be ≥ target tp_size when attention DP is enabled. Validates draft_tp_size constraints in disaggregated serving mode.
Testing
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/test-db/l0_dgx_b200.yml
Adds DraftTargetDecodingConfig to public exports and three new test methods (test_eagle3_draft_tp_size, test_draft_target_draft_tp_size, test_mtp_draft_tp_size_4gpus) validating draft_tp_size functionality across speculative decoding configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • Superjomn
  • syuoni
  • yuxianq
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: allowing different TP (tensor-parallel) configuration for draft versus target models, which is the primary objective of this pull request.
Description check ✅ Passed The PR description provides a clear explanation of what the change does (allow smaller TP size for draft models), acknowledges the limitation (only works with separated draft/target KV cache), explains the motivation (building blocks for algorithms like PEARL), mentions test coverage, and includes a completed checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

10-10: Keep the mapping namespace on this import.

Please import the module and call mapping.create_draft_mapping(...) instead of importing the helper directly.

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use from package.subpackage import foo then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_speculative.py` at line 10, The file
currently imports the helper directly (from tensorrt_llm.mapping import
create_draft_mapping); change the import to keep the mapping namespace (e.g.,
import tensorrt_llm.mapping as mapping or from tensorrt_llm import mapping) and
update all call sites to use mapping.create_draft_mapping(...) instead of
create_draft_mapping(...), ensuring the referenced symbol create_draft_mapping
is only accessed through the mapping module.
tensorrt_llm/_torch/pyexecutor/_util.py (1)

575-584: Extract this ValueError message into a constant/helper.

Ruff is already flagging this block with TRY003, so this validation will keep the file lint-red as written.

♻️ Minimal cleanup
+DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG = (
+    "draft_tp_size ({draft_tp}) < tp_size ({tp_size}) requires separate "
+    "draft/target KV caches, which are not supported with attention data "
+    "parallelism."
+)
+
         if (draft_tp is not None
                 and draft_tp < self._mapping.tp_size):
             raise ValueError(
-                f"draft_tp_size ({draft_tp}) < tp_size "
-                f"({self._mapping.tp_size}) requires separate "
-                "draft/target KV caches, which are not supported "
-                "with attention data parallelism."
+                DRAFT_TP_ATTENTION_DP_UNSUPPORTED_MSG.format(
+                    draft_tp=draft_tp, tp_size=self._mapping.tp_size)
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 575 - 584, Extract the
long ValueError message into a reusable module-level constant or a small helper
formatter and use that instead of an inline f-string; for example create
DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) < tp_size ({}) requires separate
draft/target KV caches, which are not supported with attention data
parallelism." or a helper format_draft_tp_error(draft, tp) and then raise
ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp, self._mapping.tp_size)) (or
raise ValueError(format_draft_tp_error(draft_tp, self._mapping.tp_size))) where
draft_tp is computed from self._speculative_config and compared to
self._mapping.tp_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 1074-1084: The synthetic draft config created with
replace(model_config, mapping=draft_mapping) still carries the original
pretrained_config.num_hidden_layers (target layer count), causing KV cache
sizing to use the wrong layer count; update the draft config creation in
modeling_speculative.py (the draft_model_config variable passed into
get_draft_model) to also set the draft layer count (e.g., clone/replace
pretrained_config.num_hidden_layers or set an explicit attribute like
draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so
effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects
the actual draft layers, or alternatively propagate the draft layer count
explicitly into the KV-sizing path (ensure functions using
effective_draft_config or draft_tp_size read the draft layer count instead of
the original num_hidden_layers).

In `@tensorrt_llm/_torch/modules/attention.py`:
- Around line 441-453: The fallback branch that constructs a plain Mapping loses
draft-specific topology and drops the draft mapping used by Helix
CP/attention-DP; instead of always instantiating Mapping(...) when cp_size or
dp_size != 1, preserve and reuse the draft-aware self.mapping (or copy/clone it
while adjusting only the necessary fields) so the draft topology remains intact
for qkv_proj/o_proj sharding; update the logic in the branches around the
Mapping construction (the block that currently assigns mapping = Mapping(...),
referenced by self.mapping, Mapping, cp_size, dp_size, tp_size, pp_size, and
mapping.cp_config) to maintain the draft mapping information when building
per-branch mappings (apply same fix to the other occurrences noted at the other
ranges).

In `@tensorrt_llm/mapping.py`:
- Around line 723-731: Add an explicit validation at the start of
DraftMapping.__init__ (and by extension create_draft_mapping entry paths) to
reject non-positive draft_tp_size values: if draft_tp_size <= 0 raise
ValueError("draft_tp_size must be a positive integer") before any comparisons or
the modulo check; then keep the existing equality and divisibility checks (the
equality check with target_mapping.tp_size and the target_mapping.tp_size %
draft_tp_size divisibility check) so ZeroDivisionError cannot occur and invalid
sizes are rejected with a clear ValueError.
- Around line 736-757: The code currently forces self.cp_size/self.cp_rank and
self.attn_cp_size to 1 which can silently change a target mapping that expects
CP or attention-CP layout; add a fail-fast guard before those assignments that
checks target_mapping.cp_size and target_mapping.attn_cp_size (e.g. if
target_mapping.cp_size != 1 or target_mapping.attn_cp_size != 1) and raise a
clear ValueError mentioning the unsupported combination (including
draft_tp_size, target_mapping.cp_size and target_mapping.attn_cp_size) so the
caller knows the topology mismatch; keep the existing
rank/world_size/enable_attention_dp logic but do not override CP-related fields
when the target requires CP or attention-DP.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Line 10: The file currently imports the helper directly (from
tensorrt_llm.mapping import create_draft_mapping); change the import to keep the
mapping namespace (e.g., import tensorrt_llm.mapping as mapping or from
tensorrt_llm import mapping) and update all call sites to use
mapping.create_draft_mapping(...) instead of create_draft_mapping(...), ensuring
the referenced symbol create_draft_mapping is only accessed through the mapping
module.

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 575-584: Extract the long ValueError message into a reusable
module-level constant or a small helper formatter and use that instead of an
inline f-string; for example create DRAFT_TP_SIZE_ERROR = "draft_tp_size ({}) <
tp_size ({}) requires separate draft/target KV caches, which are not supported
with attention data parallelism." or a helper format_draft_tp_error(draft, tp)
and then raise ValueError(DRAFT_TP_SIZE_ERROR.format(draft_tp,
self._mapping.tp_size)) (or raise ValueError(format_draft_tp_error(draft_tp,
self._mapping.tp_size))) where draft_tp is computed from
self._speculative_config and compared to self._mapping.tp_size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b0d0c92-6b21-457c-89dc-954018e521e7

📥 Commits

Reviewing files that changed from the base of the PR and between 5918348 and 588035c.

📒 Files selected for processing (10)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/model_loader.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/mapping.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml

Comment on lines +1074 to +1084
draft_model_config = None
if draft_mapping is not model_config.mapping:
draft_model_config = replace(model_config,
mapping=draft_mapping)
if self.draft_config is None:
self.draft_config = draft_model_config

self.draft_model = get_draft_model(model_config,
self.draft_config,
self.lm_head, self.model)
self.lm_head, self.model,
draft_model_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The synthetic MTP draft config still carries the target layer count.

replace(model_config, mapping=draft_mapping) only swaps the mapping. The cloned config still has pretrained_config.num_hidden_layers == target_num_layers, but tensorrt_llm/_torch/pyexecutor/_util.py now sizes the separate draft KV cache from effective_draft_config. In MTP one-model mode that makes KV estimation count all target layers instead of just the draft layers, which can drastically shrink max_tokens or raise false OOMs for the new draft_tp_size path. Please either synthesize a draft config with the draft layer count, or keep the layer count explicit in the KV-size path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 1074 - 1084,
The synthetic draft config created with replace(model_config,
mapping=draft_mapping) still carries the original
pretrained_config.num_hidden_layers (target layer count), causing KV cache
sizing to use the wrong layer count; update the draft config creation in
modeling_speculative.py (the draft_model_config variable passed into
get_draft_model) to also set the draft layer count (e.g., clone/replace
pretrained_config.num_hidden_layers or set an explicit attribute like
draft_model_config.pretrained_config.num_hidden_layers = draft_layer_count) so
effective_draft_config used by tensorrt_llm/_torch/pyexecutor/_util.py reflects
the actual draft layers, or alternatively propagate the draft layer count
explicitly into the KV-sizing path (ensure functions using
effective_draft_config or draft_tp_size read the draft layer count instead of
the original num_hidden_layers).

Comment on lines +441 to +453
if dp_size == 1 and cp_size == 1:
mapping = self.mapping
else:
mapping = Mapping(
world_size=dp_size * tp_size * pp_size * cp_size,
tp_size=tp_size,
pp_size=pp_size * dp_size,
cp_size=cp_size,
cp_config=self.mapping.cp_config,
rank=self.mapping.rank,
gpus_per_node=self.mapping.gpus_per_node,
enable_attention_dp=self.mapping.enable_attention_dp,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the draft mapping in the CP/attention-DP branches.

The fast path now keeps self.mapping intact, but the fallback still rebuilds a plain Mapping. That drops the draft-specific topology this PR is introducing, so any draft run that also enables Helix CP or attention-DP can fall back to raw-rank sharding here and feed the wrong local slices into qkv_proj / o_proj.

Also applies to: 501-512, 1108-1120, 1216-1227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/attention.py` around lines 441 - 453, The
fallback branch that constructs a plain Mapping loses draft-specific topology
and drops the draft mapping used by Helix CP/attention-DP; instead of always
instantiating Mapping(...) when cp_size or dp_size != 1, preserve and reuse the
draft-aware self.mapping (or copy/clone it while adjusting only the necessary
fields) so the draft topology remains intact for qkv_proj/o_proj sharding;
update the logic in the branches around the Mapping construction (the block that
currently assigns mapping = Mapping(...), referenced by self.mapping, Mapping,
cp_size, dp_size, tp_size, pp_size, and mapping.cp_config) to maintain the draft
mapping information when building per-branch mappings (apply same fix to the
other occurrences noted at the other ranges).

Comment on lines +723 to +731
def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
if draft_tp_size == target_mapping.tp_size:
raise ValueError(
"draft_tp_size equals target tp_size; use the target mapping directly."
)
if target_mapping.tp_size % draft_tp_size != 0:
raise ValueError(
f"target tp_size ({target_mapping.tp_size}) must be divisible by "
f"draft_tp_size ({draft_tp_size}).")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate draft_tp_size before the modulo.

Line 728 throws ZeroDivisionError for draft_tp_size=0, and other non-positive values fail later with much less useful errors. Since DraftMapping / create_draft_mapping() are public entry points now, reject invalid sizes up front.

Suggested guard
     def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
+        if not isinstance(draft_tp_size, int) or draft_tp_size < 1:
+            raise ValueError(
+                f"draft_tp_size must be a positive integer, got {draft_tp_size!r}."
+            )
         if draft_tp_size == target_mapping.tp_size:
             raise ValueError(
                 "draft_tp_size equals target tp_size; use the target mapping directly."
             )
🧰 Tools
🪛 Ruff (0.15.4)

[warning] 725-727: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 729-731: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/mapping.py` around lines 723 - 731, Add an explicit validation
at the start of DraftMapping.__init__ (and by extension create_draft_mapping
entry paths) to reject non-positive draft_tp_size values: if draft_tp_size <= 0
raise ValueError("draft_tp_size must be a positive integer") before any
comparisons or the modulo check; then keep the existing equality and
divisibility checks (the equality check with target_mapping.tp_size and the
target_mapping.tp_size % draft_tp_size divisibility check) so ZeroDivisionError
cannot occur and invalid sizes are rejected with a clear ValueError.

Comment on lines +736 to +757
self.tp_size = draft_tp_size
self.pp_size = target_mapping.pp_size
self.pp_rank = target_mapping.pp_rank
self.cp_size = 1
self.cp_rank = 0
self.gpus_per_node = target_mapping.gpus_per_node

self.world_size = draft_tp_size * self.pp_size
self.enable_attention_dp = target_mapping.enable_attention_dp
self.enable_lm_head_tp_in_adp = target_mapping.enable_lm_head_tp_in_adp
# rank setter (inherited from MappingBase) validates against
# world_size and enable_attention_dp, so those must be set first.
self.rank = (self.pp_rank * draft_tp_size +
target_mapping.tp_rank % draft_tp_size)

self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
self.moe_ep_size = 1
self.moe_cluster_size = 1
self.moe_tp_cluster_ep_size = self.moe_tp_size

self.attn_tp_size = draft_tp_size
self.attn_cp_size = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast for CP / attention-DP targets.

These lines force cp_size/cp_rank/attn_cp_size to 1 while preserving the target's PP and ADP-related state. That silently changes the topology instead of rejecting an unsupported combination.

Suggested guard
     def __init__(self, target_mapping: "Mapping", draft_tp_size: int):
+        if target_mapping.cp_size != 1 or target_mapping.enable_attention_dp:
+            raise NotImplementedError(
+                "draft_tp_size is not supported with context parallelism or attention DP yet."
+            )
         if draft_tp_size == target_mapping.tp_size:
             raise ValueError(
                 "draft_tp_size equals target tp_size; use the target mapping directly."
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/mapping.py` around lines 736 - 757, The code currently forces
self.cp_size/self.cp_rank and self.attn_cp_size to 1 which can silently change a
target mapping that expects CP or attention-CP layout; add a fail-fast guard
before those assignments that checks target_mapping.cp_size and
target_mapping.attn_cp_size (e.g. if target_mapping.cp_size != 1 or
target_mapping.attn_cp_size != 1) and raise a clear ValueError mentioning the
unsupported combination (including draft_tp_size, target_mapping.cp_size and
target_mapping.attn_cp_size) so the caller knows the topology mismatch; keep the
existing rank/world_size/enable_attention_dp logic but do not override
CP-related fields when the target requires CP or attention-DP.

Comment on lines +751 to +829
self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
self.moe_ep_size = 1
self.moe_cluster_size = 1
self.moe_tp_cluster_ep_size = self.moe_tp_size

self.attn_tp_size = draft_tp_size
self.attn_cp_size = 1

target_tp_rank = target_mapping.tp_rank
target_tp_group = list(target_mapping.tp_group)
target_tp_size = target_mapping.tp_size

self._tp_rank = target_tp_rank % draft_tp_size

group_index = target_tp_rank // draft_tp_size
start = group_index * draft_tp_size
self._tp_group = target_tp_group[start:start + draft_tp_size]

self._tp_group_pg = None
self._moe_tp_group = self._tp_group[:self.moe_tp_size]

if mpi_disabled() and draft_tp_size > 1 and torch.distributed.is_initialized():
all_tp_groups = _get_all_tp_groups(target_mapping)
num_sub = target_tp_size // draft_tp_size
for tp_grp in all_tp_groups:
for i in range(num_sub):
sub_ranks = tp_grp[i * draft_tp_size:(i + 1) *
draft_tp_size]
pg = torch.distributed.new_group(sub_ranks)
if target_mapping.rank in sub_ranks:
self._tp_group_pg = pg
elif mpi_disabled() and draft_tp_size == 1:
from tensorrt_llm._torch.device_mesh import SingleProcessGroup
self._tp_group_pg = SingleProcessGroup.get_group()

self.pp_groups = target_mapping.pp_groups
self.pp_partition = getattr(target_mapping, 'pp_partition', None)
self.cp_config = {}
self.tp_groups = []
self.cp_groups = []
self.moe_cluster_groups = []
self.moe_tp_groups = []
self.moe_ep_groups = []

def __eq__(self, other):
if not isinstance(other, DraftMapping):
return NotImplemented
return (self.tp_size == other.tp_size and self.rank == other.rank
and self._tp_group == other._tp_group)

def __hash__(self):
return hash((self.tp_size, self.rank, tuple(self._tp_group)))

@property
def tp_rank(self):
return self._tp_rank

@property
def tp_group(self):
return self._tp_group

@property
def tp_group_pg(self):
if self._tp_group_pg is None:
from tensorrt_llm._torch.device_mesh import SingleProcessGroup
return SingleProcessGroup.get_group()
return self._tp_group_pg

@property
def moe_tp_rank(self):
return self._tp_rank % self.moe_tp_size

@property
def moe_tp_group(self):
return self._moe_tp_group

@property
def moe_tp_group_pg(self):
return self.tp_group_pg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Partition MoE TP groups from the caller's slice, not always the prefix.

When draft_tp_size > target_mapping.moe_tp_size, _moe_tp_group is always built from the first moe_tp_size ranks of the draft TP group, and moe_tp_group_pg still returns the full draft TP PG. Ranks in later slices then advertise a MoE group/PG that does not contain them, which is enough to misroute or hang MoE collectives.

One way to derive the MoE subgroup from the current draft slice
         self.moe_tp_size = min(draft_tp_size, target_mapping.moe_tp_size)
+        if draft_tp_size % self.moe_tp_size != 0:
+            raise ValueError(
+                f"draft_tp_size ({draft_tp_size}) must be divisible by "
+                f"moe_tp_size ({self.moe_tp_size})."
+            )
         self.moe_ep_size = 1
         self.moe_cluster_size = 1
         self.moe_tp_cluster_ep_size = self.moe_tp_size
@@
-        self._tp_group_pg = None
-        self._moe_tp_group = self._tp_group[:self.moe_tp_size]
+        self._tp_group_pg = None
+        moe_group_start = (self._tp_rank // self.moe_tp_size) * self.moe_tp_size
+        self._moe_tp_group = self._tp_group[
+            moe_group_start:moe_group_start + self.moe_tp_size
+        ]
+        self._moe_tp_group_pg = None
@@
                     pg = torch.distributed.new_group(sub_ranks)
                     if target_mapping.rank in sub_ranks:
                         self._tp_group_pg = pg
+                    for j in range(0, draft_tp_size, self.moe_tp_size):
+                        moe_sub_ranks = sub_ranks[j:j + self.moe_tp_size]
+                        moe_pg = torch.distributed.new_group(moe_sub_ranks)
+                        if target_mapping.rank in moe_sub_ranks:
+                            self._moe_tp_group_pg = moe_pg
@@
     def moe_tp_group_pg(self):
-        return self.tp_group_pg
+        return self._moe_tp_group_pg or self.tp_group_pg

- accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_multi_gpus[latency_adp_lmtp_tp4]
- accuracy/test_llm_api_pytorch.py::TestMiniMaxM2::test_4gpus[attention_dp=False-cuda_graph=True-overlap_scheduler=True-tp_size=4-ep_size=4] TIMEOUT (60)
# ------------- draft_tp_size tests ---------------
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_mtp_draft_tp_size_4gpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add these tests into QA test list too.

@mikeiovine
Copy link
Collaborator Author

Close as cannot demonstrate perf improvement on any hardware/model/dataset

@mikeiovine mikeiovine closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants