[TRTLLM-10319][feat] Expand dynamic speculation to MTP and PARD. by zheyuf · Pull Request #12262 · NVIDIA/TensorRT-LLM

zheyuf · 2026-03-17T01:09:40Z

Summary by CodeRabbit

New Features
- Added dynamic draft length support for Pard and MTP (including MTP-Eagle) decoding modes, enabling adaptive token generation scheduling.
- Introduced new public methods for calculating runtime tokens per generation step in decoding configurations.
Tests
- Added comprehensive test coverage for dynamic draft length configurations across different decoding paths.

Description

This PR does two things:

Expand dynamic speculation to MTP, MTP-Eagle and PARD. (We already have Eagle support in the previous PR)
Clean up draft length related parameter namings, since dynamic draft length on PARD brings some complexity:

runtime_draft_len: the logical draft length K for the current iteration. For PARD this is still the algorithmic draft length, not the carried runtime width.
runtime_tokens_per_gen_step: the total number of tokens processed per generation request in the current iteration, including the golden token. For normal linear modes this is K + 1; for PARD it is 2K.
runtime_draft_token_buffer_width: the carried draft-token buffer width for the current iteration, excluding the golden token. For normal linear modes this is K; for PARD it is 2K - 1.

Test Coverage

Added tests for MTP, MTP-Eagle, PARD on dynamic draft length and max conconcurrency control in tests/integration/defs/accuracy/test_llm_api_pytorch.py.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

zheyuf · 2026-03-17T01:21:04Z

/bot run --disable-fail-fast

coderabbitai · 2026-03-17T01:25:10Z

📝 Walkthrough

Walkthrough

This PR introduces dynamic, per-iteration draft-length handling for speculative decoding by propagating a new runtime_tokens_per_gen_step concept through the execution and sampling pipelines. Instead of using fixed maximum draft lengths, the system now computes runtime_draft_token_buffer_width at runtime to adjust buffer dimensions, token counts, and generation metadata on a per-generation-step basis, with specialized behavior for PARD decoding.

Changes

Cohort / File(s)	Summary
Speculative Decoding Core `tensorrt_llm/_torch/speculative/interface.py`, `tensorrt_llm/_torch/speculative/mtp.py`, `tensorrt_llm/_torch/speculative/pard.py`	Added `runtime_tokens_per_gen_step` field to `SpecMetadata`; expanded `support_dynamic_draft_len()` to include PARD and MTP modes. Updated MTP and PARD workers to replace fixed draft-length properties with dynamic `runtime_draft_len`-based calculations, including tensor reshaping, indexing, and token-offset computations. Removed `_draft_tokens_per_req` property from PARD worker.
Execution Engine `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Introduced `get_runtime_tokens_per_gen_step()` callable and replaced hard-coded draft-length references with `runtime_tokens_per_gen_step` and `runtime_draft_token_buffer_width` computations. Updated buffer allocation, KV-cache budgeting, token padding, and CUDA graph preparation to use dynamic token-buffer widths. Added spec-decoding-mode-driven branching for PARD-specific token buffering with `DRAFT_BUFFER_PAD`.
Attention Backend `tensorrt_llm/_torch/attention_backend/trtllm.py`	Changed runtime-draft-length calculation in linear-tree spec-decoding path to derive `runtime_draft_token_buffer_width` from `spec_metadata.runtime_tokens_per_gen_step` instead of fixed max-draft-length, affecting generation-length and packed-mask buffer dimensions.
Configuration & API `tensorrt_llm/llmapi/llm_args.py`	Added public methods `get_runtime_tokens_per_gen_step()` to `DecodingBaseConfig` (returning `1 + runtime_draft_len`) and `PARDDecodingConfig` (returning `1` if `runtime_draft_len == 0`, else `2 * runtime_draft_len`).
Tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/qa/llm_function_core.txt`	Added three parameterized test methods for dynamic draft-length configurations: `test_pard_dynamic_draft_len`, `test_bfloat16_mtp_dynamic_draft_len`, and `test_bfloat16_mtp_eagle_dynamic_draft_len`, covering max-concurrency and draft-length-schedule toggles with GSM8K evaluation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

#10860: Implements parallel dynamic draft-length handling across CUDA-graph padding and per-draft-length buffer/cache management in the same executor and speculative-decoding code paths.
#6300: Modifies speculative-decoding pipeline and pyexecutor/model-drafter integration, overlapping in guided-decoding and draft-token bookkeeping logic.
#11878: Modifies PARD/MTP workers and spec-metadata in the same files, affecting speculative-decoding internals.

Suggested reviewers

mikeiovine
QiJune
StanleySun639
Shixiaowei02
Tabrizian

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title clearly and concisely describes the primary change: expanding dynamic speculation (draft length) support to MTP and PARD decoding modes. It directly relates to the main objectives outlined in the PR.
Description check	✅ Passed	The PR description provides clear details about the changes (expanding dynamic draft length to MTP/PARD and clarifying parameter naming), includes test coverage information, and confirms the PR checklist was reviewed. All major template sections are addressed substantively.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2366-2412: ⚠️ Potential issue | 🔴 Critical

The first tree-decoding step still assumes the configured max width.

This branch still requires num_draft_tokens == spec_tree_manager.max_total_draft_tokens and appends the full spec_dec_position_offsets[0]. Dynamic PARD batches below the configured maximum will fail here on the first generation step or warmup; if assertions are stripped, position_ids becomes longer than 1 + num_draft_tokens.

Possible fix

                 if not self.is_draft_model and not spec_config.is_linear_tree:
                     assert spec_tree_manager is not None
-                    assert num_draft_tokens == spec_tree_manager.max_total_draft_tokens
+                    assert num_draft_tokens <= spec_tree_manager.max_total_draft_tokens
                     position_ids.extend(
                         past_seen_token_num +
-                        spec_tree_manager.spec_dec_position_offsets[
-                            0]  # [max_total_draft_tokens + 1]
+                        spec_tree_manager.spec_dec_position_offsets[0][
+                            :1 + num_draft_tokens]
                     )

The same runtime slice should be applied anywhere else that consumes spec_dec_position_offsets[0].

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2366 - 2412, The
code in the branch that handles tree decoding (inside model_engine where
spec_tree_manager is used) assumes num_draft_tokens equals
spec_tree_manager.max_total_draft_tokens and appends the entire
spec_dec_position_offsets[0], which breaks dynamic PARD smaller-than-configured
batches; remove the strict equality/assertion and instead extend position_ids
with only the runtime slice of spec_tree_manager.spec_dec_position_offsets that
corresponds to the actual tokens (e.g. use
spec_tree_manager.spec_dec_position_offsets[0:1 + num_draft_tokens] or otherwise
index up to 1 + num_draft_tokens) so position_ids length matches 1 +
num_draft_tokens; apply the same runtime slicing anywhere else
spec_dec_position_offsets[0] is consumed.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1652-1652: Consider moving constant definition outside the loop.

DRAFT_BUFFER_PAD is redefined on each iteration of the for loop. While the performance impact is negligible, moving it before the loop (around line 1651) would be slightly cleaner.

♻️ Suggested refactor

             runtime_draft_len = get_draft_len_for_batch_size(
                 self.model_engine.spec_config.draft_len_schedule,
                 scheduled_batch.batch_size, self.model_engine.max_draft_len)
             # 2. Pad or truncate draft tokens to the resolved length
-            DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
+            DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
             for request in scheduled_batch.generation_requests:
-                DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
                 current_num_draft_tokens = len(request.py_draft_tokens)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 1652, DRAFT_BUFFER_PAD
is being set inside the loop each iteration; pull the constant definition out of
the loop by declaring DRAFT_BUFFER_PAD = 0 just once immediately before the
enclosing for loop (so the loop body uses the already-defined symbol), ensuring
any references inside the loop continue to use the same constant and no other
logic changes are needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1602-1609: The code dereferences
spec_metadata.runtime_tokens_per_gen_step without guarding spec_metadata; update
the branch around runtime_draft_token_buffer_width calculation to first check
spec_metadata is not None (or explicitly enforce the precondition) and either
use a defined fixed-width fallback when spec_metadata is None or raise a clear
ValueError. Specifically, protect access to
spec_metadata.runtime_tokens_per_gen_step before computing
runtime_draft_token_buffer_width, then call
generate_spec_decoding_generation_length(runtime_draft_len=...), and compute
spec_decoding_position_offsets and spec_decoding_packed_mask only after
determining runtime_draft_token_buffer_width; reference spec_metadata,
runtime_tokens_per_gen_step, runtime_draft_token_buffer_width,
generate_spec_decoding_generation_length,
generate_spec_decoding_position_offsets, generate_spec_decoding_packed_mask, and
max_num_requests when making the guard or fallback change.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 1202-1204: The warmup-sizing uses
get_runtime_tokens_per_gen_step(draft_len) with a value that may already be a
buffer width (e.g. self.max_total_draft_tokens for non-dynamic path), inflating
sizes; change the call sites so _get_graphs_to_capture / warmup sizing use the
logical draft length (K) not the buffer width (2K-1). Concretely, compute a
logical runtime_draft_len from draft_len or from self.max_total_draft_tokens by
converting buffer-width to K when needed, then pass that logical value into
get_runtime_tokens_per_gen_step and use it to compute
runtime_draft_token_buffer_width, update any places that set
self.runtime_draft_len, the warmup request, and KV budgeting to use this logical
runtime_draft_len (symbols to adjust: get_runtime_tokens_per_gen_step,
runtime_tokens_per_gen_step, runtime_draft_token_buffer_width,
_get_graphs_to_capture, self.max_total_draft_tokens, self.runtime_draft_len).

In `@tensorrt_llm/_torch/speculative/interface.py`:
- Around line 285-287: Update the comment above the runtime_tokens_per_gen_step
variable to clarify the PARD edge case: explain that normally
runtime_tokens_per_gen_step equals 1 + runtime_draft_len, and for PARD it equals
2 * runtime_draft_len except when K=0 (in which case runtime_tokens_per_gen_step
is 1), referencing the PARD mode and the runtime_draft_len and K variables so
readers understand the K=0 special-case behavior for
runtime_tokens_per_gen_step.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 601-609: The THOP branch calling
torch.ops.trtllm.mtp_update_hidden_states_op currently passes runtime_draft_len
which causes THOP to only retain a shortened MTP history; change the argument to
max_draft_len (self.spec_config.num_nextn_predict_layers) so THOP refreshes the
full MTP history window the same way the eager path does, ensuring both branches
update the same number of draft entries (compare the call in the is_thop block
and the eager update that uses max_draft_len).

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 439-445: The test function test_pard_dynamic_draft_len is missing
the Hopper-gating decorator; add the `@skip_pre_hopper` decorator immediately
above the function definition so it matches other PARD tests and will be skipped
on pre-Hopper runners; ensure the decorator is imported/available where other
tests use skip_pre_hopper so the new annotation compiles and is applied to
test_pard_dynamic_draft_len.

---

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2366-2412: The code in the branch that handles tree decoding
(inside model_engine where spec_tree_manager is used) assumes num_draft_tokens
equals spec_tree_manager.max_total_draft_tokens and appends the entire
spec_dec_position_offsets[0], which breaks dynamic PARD smaller-than-configured
batches; remove the strict equality/assertion and instead extend position_ids
with only the runtime slice of spec_tree_manager.spec_dec_position_offsets that
corresponds to the actual tokens (e.g. use
spec_tree_manager.spec_dec_position_offsets[0:1 + num_draft_tokens] or otherwise
index up to 1 + num_draft_tokens) so position_ids length matches 1 +
num_draft_tokens; apply the same runtime slicing anywhere else
spec_dec_position_offsets[0] is consumed.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Line 1652: DRAFT_BUFFER_PAD is being set inside the loop each iteration; pull
the constant definition out of the loop by declaring DRAFT_BUFFER_PAD = 0 just
once immediately before the enclosing for loop (so the loop body uses the
already-defined symbol), ensuring any references inside the loop continue to use
the same constant and no other logic changes are needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d13b632c-a973-4679-97c2-b9e441e96b1d

📥 Commits

Reviewing files that changed from the base of the PR and between 0eab5b6 and de4c894.

📒 Files selected for processing (10)

tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/interface.py
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/speculative/pard.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/qa/llm_function_core.txt

tensorrt_llm/_torch/attention_backend/trtllm.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

coderabbitai · 2026-03-17T01:25:13Z

tensorrt_llm/_torch/speculative/interface.py

+    # Total runtime tokens per generation request for the current iteration,
+    # Normally, it equals 1 + runtime_draft_len. But for PARD, it equals 2 * runtime_draft_len.
+    runtime_tokens_per_gen_step: int = 1


⚠️ Potential issue | 🟡 Minor

Update the PARD runtime-token comment for K=0.

Line 286 states PARD uses 2 * runtime_draft_len, but the runtime behavior for K=0 is 1 token. The comment should reflect that edge case to avoid confusion.

Suggested doc fix

- # Normally, it equals 1 + runtime_draft_len. But for PARD, it equals 2 * runtime_draft_len. + # Normally, it equals 1 + runtime_draft_len. + # For PARD, it equals 2 * runtime_draft_len when runtime_draft_len > 0, else 1.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/speculative/interface.py` around lines 285 - 287, Update the comment above the runtime_tokens_per_gen_step variable to clarify the PARD edge case: explain that normally runtime_tokens_per_gen_step equals 1 + runtime_draft_len, and for PARD it equals 2 * runtime_draft_len except when K=0 (in which case runtime_tokens_per_gen_step is 1), referencing the PARD mode and the runtime_draft_len and K variables so readers understand the K=0 special-case behavior for runtime_tokens_per_gen_step.

coderabbitai · 2026-03-17T01:25:13Z

tensorrt_llm/_torch/speculative/mtp.py

+        runtime_draft_len = spec_metadata.runtime_draft_len
+        max_draft_len = self.spec_config.num_nextn_predict_layers

        if self.is_thop:
            _, _ = torch.ops.trtllm.mtp_update_hidden_states_op(
                input_ids, seq_lens, hidden_states,
                spec_metadata.mtp_hidden_states_ptrs,
                spec_metadata.mtp_past_tokens_ptrs, num_accepted_tokens,
-                mtp_num_modules, batch_size, num_contexts, hidden_size)
+                runtime_draft_len, batch_size, num_contexts, hidden_size)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Python call sites ==" rg -n -C4 'mtp_update_hidden_states_op' tensorrt_llm echo echo "== Native op registration / implementation ==" rg -n -C6 'mtp_update_hidden_states(_op)?' . -g '!**/build/**' -g '!**/dist/**'

Repository: NVIDIA/TensorRT-LLM

Length of output: 6303

🏁 Script executed:

sed -n '150,250p' cpp/tensorrt_llm/thop/specDecOp.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 4543

🏁 Script executed:

sed -n '610,660p' tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2889

🏁 Script executed:

sed -n '660,710p' tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2659

🏁 Script executed:

grep -rn "numMTPModules" cpp/tensorrt_llm --include="*.h" --include="*.cpp" -A2 -B2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 8028

🏁 Script executed:

grep -rn "invokeMTPUpdateHiddenStates" cpp/tensorrt_llm --include="*.cu" -A30 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 5513

🏁 Script executed:

grep -n "mtpUpdateHiddenStatesKernel" cpp/tensorrt_llm/kernels/speculativeDecoding/mtpKernels.cu -A60 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4929

🏁 Script executed:

grep -n "mtp_hidden_states_pool\|mtp_past_tokens_pool" tensorrt_llm/_torch/speculative/mtp.py -B2 -A2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 1966

Verify THOP refreshes the full MTP history window.

The kernel code in mtpKernels.cu line 386 and 395 uses numMTPModules directly to control the history window size: for (int ii = 0; ii < numMTPModules; ii++). Since line 609 now passes runtime_draft_len as this parameter, THOP mode will maintain a history window of only runtime_draft_len entries. Meanwhile, the eager path at lines 674-676 always updates max_draft_len entries in the pools. When runtime_draft_len shrinks and later grows, THOP mode will have discarded history that eager mode preserved, causing a divergence.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 601 - 609, The THOP branch calling torch.ops.trtllm.mtp_update_hidden_states_op currently passes runtime_draft_len which causes THOP to only retain a shortened MTP history; change the argument to max_draft_len (self.spec_config.num_nextn_predict_layers) so THOP refreshes the full MTP history window the same way the eager path does, ensuring both branches update the same number of draft entries (compare the call in the is_thop block and the eager update that uses max_draft_len).

coderabbitai · 2026-03-17T01:25:13Z

tests/integration/defs/accuracy/test_llm_api_pytorch.py

+    @pytest.mark.skip_less_device_memory(60000)
+    @parametrize_with_ids("enable_max_concurrency,enable_draft_len_schedule", [
+        (False, True),
+        (True, False),
+    ])
+    def test_pard_dynamic_draft_len(self, enable_max_concurrency,
+                                    enable_draft_len_schedule):


⚠️ Potential issue | 🟠 Major

Add Hopper gating for the new PARD dynamic-draft test.

test_pard_dynamic_draft_len is missing @skip_pre_hopper, unlike other PARD tests in this class. This can fail on unsupported pre-Hopper runners.

🔧 Suggested patch

+ `@skip_pre_hopper` `@pytest.mark.skip_less_device_memory`(60000) `@parametrize_with_ids`("enable_max_concurrency,enable_draft_len_schedule", [ (False, True), (True, False), ])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 439 - 445, The test function test_pard_dynamic_draft_len is missing the Hopper-gating decorator; add the `@skip_pre_hopper` decorator immediately above the function definition so it matches other PARD tests and will be skipped on pre-Hopper runners; ensure the decorator is imported/available where other tests use skip_pre_hopper so the new annotation compiles and is applied to test_pard_dynamic_draft_len.

tensorrt-cicd · 2026-03-17T01:28:03Z

PR_Github #39151 [ run ] triggered by Bot. Commit: de4c894 Link to invocation

tensorrt-cicd · 2026-03-17T08:24:12Z

PR_Github #39151 [ run ] completed with state FAILURE. Commit: de4c894
/LLM/main/L0_MergeRequest_PR pipeline #30410 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

zheyuf · 2026-03-17T19:11:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-17T19:17:50Z

PR_Github #39310 [ run ] triggered by Bot. Commit: 864f29b Link to invocation

mikeiovine · 2026-03-17T19:44:07Z

tensorrt_llm/_torch/speculative/interface.py

+        return self.is_mtp_one_model() or self.is_eagle3_one_model(
+        ) or self.is_pard()


We should add draft/target support too

crazydemo · 2026-03-18T02:24:21Z

tests/integration/defs/accuracy/test_llm_api_pytorch.py

            task.evaluate(llm, extra_acc_spec="use_sa_spec")

+    @pytest.mark.skip_less_device_memory(60000)
+    @parametrize_with_ids("enable_max_concurrency,enable_draft_len_schedule", [


whether there's a constraint on sm version?

tensorrt-cicd · 2026-03-18T05:12:51Z

PR_Github #39310 [ run ] completed with state SUCCESS. Commit: 864f29b
/LLM/main/L0_MergeRequest_PR pipeline #30558 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zheyuf added 2 commits March 16, 2026 23:36

Support MTP, MTP-Eagle, PARD.

2177aa6

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

Clear naming.

de4c894

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

github-actions bot assigned zheyuf Mar 17, 2026

zheyuf requested a review from mikeiovine March 17, 2026 01:12

zheyuf marked this pull request as ready for review March 17, 2026 01:12

zheyuf requested review from a team as code owners March 17, 2026 01:12

zheyuf requested review from brb-nv and syuoni March 17, 2026 01:12

zheyuf enabled auto-merge (squash) March 17, 2026 01:21

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Fix CI

864f29b

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>

zheyuf changed the title ~~[TRTLLM-10319][feat] Expand dynamic draft length to MTP and PARD.~~ [TRTLLM-10319][feat] Expand dynamic speculation to MTP and PARD. Mar 17, 2026

mikeiovine reviewed Mar 17, 2026

View reviewed changes

crazydemo reviewed Mar 18, 2026

View reviewed changes

		return self.is_mtp_one_model() or self.is_eagle3_one_model(
		) or self.is_pard()

Conversation

zheyuf commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

zheyuf commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

zheyuf commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

mikeiovine Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

crazydemo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zheyuf commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading