Skip to content

[TRTLLM-10319][feat] Expand dynamic speculation to MTP and PARD.#12262

Open
zheyuf wants to merge 3 commits intoNVIDIA:mainfrom
zheyuf:MTP_PARD_0315
Open

[TRTLLM-10319][feat] Expand dynamic speculation to MTP and PARD.#12262
zheyuf wants to merge 3 commits intoNVIDIA:mainfrom
zheyuf:MTP_PARD_0315

Conversation

@zheyuf
Copy link
Collaborator

@zheyuf zheyuf commented Mar 17, 2026

Summary by CodeRabbit

  • New Features

    • Added dynamic draft length support for Pard and MTP (including MTP-Eagle) decoding modes, enabling adaptive token generation scheduling.
    • Introduced new public methods for calculating runtime tokens per generation step in decoding configurations.
  • Tests

    • Added comprehensive test coverage for dynamic draft length configurations across different decoding paths.

Description

This PR does two things:

  1. Expand dynamic speculation to MTP, MTP-Eagle and PARD. (We already have Eagle support in the previous PR)
  2. Clean up draft length related parameter namings, since dynamic draft length on PARD brings some complexity:
  • runtime_draft_len: the logical draft length K for the current iteration. For PARD this is still the algorithmic draft length, not the carried runtime width.
  • runtime_tokens_per_gen_step: the total number of tokens processed per generation request in the current iteration, including the golden token. For normal linear modes this is K + 1; for PARD it is 2K.
  • runtime_draft_token_buffer_width: the carried draft-token buffer width for the current iteration, excluding the golden token. For normal linear modes this is K; for PARD it is 2K - 1.

Test Coverage

Added tests for MTP, MTP-Eagle, PARD on dynamic draft length and max conconcurrency control in tests/integration/defs/accuracy/test_llm_api_pytorch.py.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

zheyuf added 2 commits March 16, 2026 23:36
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
@zheyuf zheyuf requested a review from mikeiovine March 17, 2026 01:12
@zheyuf zheyuf marked this pull request as ready for review March 17, 2026 01:12
@zheyuf zheyuf requested review from a team as code owners March 17, 2026 01:12
@zheyuf zheyuf requested review from brb-nv and syuoni March 17, 2026 01:12
@zheyuf
Copy link
Collaborator Author

zheyuf commented Mar 17, 2026

/bot run --disable-fail-fast

@zheyuf zheyuf enabled auto-merge (squash) March 17, 2026 01:21
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This PR introduces dynamic, per-iteration draft-length handling for speculative decoding by propagating a new runtime_tokens_per_gen_step concept through the execution and sampling pipelines. Instead of using fixed maximum draft lengths, the system now computes runtime_draft_token_buffer_width at runtime to adjust buffer dimensions, token counts, and generation metadata on a per-generation-step basis, with specialized behavior for PARD decoding.

Changes

Cohort / File(s) Summary
Speculative Decoding Core
tensorrt_llm/_torch/speculative/interface.py, tensorrt_llm/_torch/speculative/mtp.py, tensorrt_llm/_torch/speculative/pard.py
Added runtime_tokens_per_gen_step field to SpecMetadata; expanded support_dynamic_draft_len() to include PARD and MTP modes. Updated MTP and PARD workers to replace fixed draft-length properties with dynamic runtime_draft_len-based calculations, including tensor reshaping, indexing, and token-offset computations. Removed _draft_tokens_per_req property from PARD worker.
Execution Engine
tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py, tensorrt_llm/_torch/pyexecutor/py_executor.py
Introduced get_runtime_tokens_per_gen_step() callable and replaced hard-coded draft-length references with runtime_tokens_per_gen_step and runtime_draft_token_buffer_width computations. Updated buffer allocation, KV-cache budgeting, token padding, and CUDA graph preparation to use dynamic token-buffer widths. Added spec-decoding-mode-driven branching for PARD-specific token buffering with DRAFT_BUFFER_PAD.
Attention Backend
tensorrt_llm/_torch/attention_backend/trtllm.py
Changed runtime-draft-length calculation in linear-tree spec-decoding path to derive runtime_draft_token_buffer_width from spec_metadata.runtime_tokens_per_gen_step instead of fixed max-draft-length, affecting generation-length and packed-mask buffer dimensions.
Configuration & API
tensorrt_llm/llmapi/llm_args.py
Added public methods get_runtime_tokens_per_gen_step() to DecodingBaseConfig (returning 1 + runtime_draft_len) and PARDDecodingConfig (returning 1 if runtime_draft_len == 0, else 2 * runtime_draft_len).
Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/qa/llm_function_core.txt
Added three parameterized test methods for dynamic draft-length configurations: test_pard_dynamic_draft_len, test_bfloat16_mtp_dynamic_draft_len, and test_bfloat16_mtp_eagle_dynamic_draft_len, covering max-concurrency and draft-length-schedule toggles with GSM8K evaluation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • #10860: Implements parallel dynamic draft-length handling across CUDA-graph padding and per-draft-length buffer/cache management in the same executor and speculative-decoding code paths.
  • #6300: Modifies speculative-decoding pipeline and pyexecutor/model-drafter integration, overlapping in guided-decoding and draft-token bookkeeping logic.
  • #11878: Modifies PARD/MTP workers and spec-metadata in the same files, affecting speculative-decoding internals.

Suggested reviewers

  • mikeiovine
  • QiJune
  • StanleySun639
  • Shixiaowei02
  • Tabrizian
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title clearly and concisely describes the primary change: expanding dynamic speculation (draft length) support to MTP and PARD decoding modes. It directly relates to the main objectives outlined in the PR.
Description check ✅ Passed The PR description provides clear details about the changes (expanding dynamic draft length to MTP/PARD and clarifying parameter naming), includes test coverage information, and confirms the PR checklist was reviewed. All major template sections are addressed substantively.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2366-2412: ⚠️ Potential issue | 🔴 Critical

The first tree-decoding step still assumes the configured max width.

This branch still requires num_draft_tokens == spec_tree_manager.max_total_draft_tokens and appends the full spec_dec_position_offsets[0]. Dynamic PARD batches below the configured maximum will fail here on the first generation step or warmup; if assertions are stripped, position_ids becomes longer than 1 + num_draft_tokens.

Possible fix
                 if not self.is_draft_model and not spec_config.is_linear_tree:
                     assert spec_tree_manager is not None
-                    assert num_draft_tokens == spec_tree_manager.max_total_draft_tokens
+                    assert num_draft_tokens <= spec_tree_manager.max_total_draft_tokens
                     position_ids.extend(
                         past_seen_token_num +
-                        spec_tree_manager.spec_dec_position_offsets[
-                            0]  # [max_total_draft_tokens + 1]
+                        spec_tree_manager.spec_dec_position_offsets[0][
+                            :1 + num_draft_tokens]
                     )

The same runtime slice should be applied anywhere else that consumes spec_dec_position_offsets[0].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2366 - 2412, The
code in the branch that handles tree decoding (inside model_engine where
spec_tree_manager is used) assumes num_draft_tokens equals
spec_tree_manager.max_total_draft_tokens and appends the entire
spec_dec_position_offsets[0], which breaks dynamic PARD smaller-than-configured
batches; remove the strict equality/assertion and instead extend position_ids
with only the runtime slice of spec_tree_manager.spec_dec_position_offsets that
corresponds to the actual tokens (e.g. use
spec_tree_manager.spec_dec_position_offsets[0:1 + num_draft_tokens] or otherwise
index up to 1 + num_draft_tokens) so position_ids length matches 1 +
num_draft_tokens; apply the same runtime slicing anywhere else
spec_dec_position_offsets[0] is consumed.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1652-1652: Consider moving constant definition outside the loop.

DRAFT_BUFFER_PAD is redefined on each iteration of the for loop. While the performance impact is negligible, moving it before the loop (around line 1651) would be slightly cleaner.

♻️ Suggested refactor
             runtime_draft_len = get_draft_len_for_batch_size(
                 self.model_engine.spec_config.draft_len_schedule,
                 scheduled_batch.batch_size, self.model_engine.max_draft_len)
             # 2. Pad or truncate draft tokens to the resolved length
-            DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
+            DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
             for request in scheduled_batch.generation_requests:
-                DRAFT_BUFFER_PAD = 0  # Buffer sentinel, not PARD mask_token_id.
                 current_num_draft_tokens = len(request.py_draft_tokens)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 1652, DRAFT_BUFFER_PAD
is being set inside the loop each iteration; pull the constant definition out of
the loop by declaring DRAFT_BUFFER_PAD = 0 just once immediately before the
enclosing for loop (so the loop body uses the already-defined symbol), ensuring
any references inside the loop continue to use the same constant and no other
logic changes are needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1602-1609: The code dereferences
spec_metadata.runtime_tokens_per_gen_step without guarding spec_metadata; update
the branch around runtime_draft_token_buffer_width calculation to first check
spec_metadata is not None (or explicitly enforce the precondition) and either
use a defined fixed-width fallback when spec_metadata is None or raise a clear
ValueError. Specifically, protect access to
spec_metadata.runtime_tokens_per_gen_step before computing
runtime_draft_token_buffer_width, then call
generate_spec_decoding_generation_length(runtime_draft_len=...), and compute
spec_decoding_position_offsets and spec_decoding_packed_mask only after
determining runtime_draft_token_buffer_width; reference spec_metadata,
runtime_tokens_per_gen_step, runtime_draft_token_buffer_width,
generate_spec_decoding_generation_length,
generate_spec_decoding_position_offsets, generate_spec_decoding_packed_mask, and
max_num_requests when making the guard or fallback change.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 1202-1204: The warmup-sizing uses
get_runtime_tokens_per_gen_step(draft_len) with a value that may already be a
buffer width (e.g. self.max_total_draft_tokens for non-dynamic path), inflating
sizes; change the call sites so _get_graphs_to_capture / warmup sizing use the
logical draft length (K) not the buffer width (2K-1). Concretely, compute a
logical runtime_draft_len from draft_len or from self.max_total_draft_tokens by
converting buffer-width to K when needed, then pass that logical value into
get_runtime_tokens_per_gen_step and use it to compute
runtime_draft_token_buffer_width, update any places that set
self.runtime_draft_len, the warmup request, and KV budgeting to use this logical
runtime_draft_len (symbols to adjust: get_runtime_tokens_per_gen_step,
runtime_tokens_per_gen_step, runtime_draft_token_buffer_width,
_get_graphs_to_capture, self.max_total_draft_tokens, self.runtime_draft_len).

In `@tensorrt_llm/_torch/speculative/interface.py`:
- Around line 285-287: Update the comment above the runtime_tokens_per_gen_step
variable to clarify the PARD edge case: explain that normally
runtime_tokens_per_gen_step equals 1 + runtime_draft_len, and for PARD it equals
2 * runtime_draft_len except when K=0 (in which case runtime_tokens_per_gen_step
is 1), referencing the PARD mode and the runtime_draft_len and K variables so
readers understand the K=0 special-case behavior for
runtime_tokens_per_gen_step.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 601-609: The THOP branch calling
torch.ops.trtllm.mtp_update_hidden_states_op currently passes runtime_draft_len
which causes THOP to only retain a shortened MTP history; change the argument to
max_draft_len (self.spec_config.num_nextn_predict_layers) so THOP refreshes the
full MTP history window the same way the eager path does, ensuring both branches
update the same number of draft entries (compare the call in the is_thop block
and the eager update that uses max_draft_len).

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 439-445: The test function test_pard_dynamic_draft_len is missing
the Hopper-gating decorator; add the `@skip_pre_hopper` decorator immediately
above the function definition so it matches other PARD tests and will be skipped
on pre-Hopper runners; ensure the decorator is imported/available where other
tests use skip_pre_hopper so the new annotation compiles and is applied to
test_pard_dynamic_draft_len.

---

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2366-2412: The code in the branch that handles tree decoding
(inside model_engine where spec_tree_manager is used) assumes num_draft_tokens
equals spec_tree_manager.max_total_draft_tokens and appends the entire
spec_dec_position_offsets[0], which breaks dynamic PARD smaller-than-configured
batches; remove the strict equality/assertion and instead extend position_ids
with only the runtime slice of spec_tree_manager.spec_dec_position_offsets that
corresponds to the actual tokens (e.g. use
spec_tree_manager.spec_dec_position_offsets[0:1 + num_draft_tokens] or otherwise
index up to 1 + num_draft_tokens) so position_ids length matches 1 +
num_draft_tokens; apply the same runtime slicing anywhere else
spec_dec_position_offsets[0] is consumed.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Line 1652: DRAFT_BUFFER_PAD is being set inside the loop each iteration; pull
the constant definition out of the loop by declaring DRAFT_BUFFER_PAD = 0 just
once immediately before the enclosing for loop (so the loop body uses the
already-defined symbol), ensuring any references inside the loop continue to use
the same constant and no other logic changes are needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d13b632c-a973-4679-97c2-b9e441e96b1d

📥 Commits

Reviewing files that changed from the base of the PR and between 0eab5b6 and de4c894.

📒 Files selected for processing (10)
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/speculative/pard.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/qa/llm_function_core.txt

Comment on lines +285 to +287
# Total runtime tokens per generation request for the current iteration,
# Normally, it equals 1 + runtime_draft_len. But for PARD, it equals 2 * runtime_draft_len.
runtime_tokens_per_gen_step: int = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the PARD runtime-token comment for K=0.

Line 286 states PARD uses 2 * runtime_draft_len, but the runtime behavior for K=0 is 1 token. The comment should reflect that edge case to avoid confusion.

Suggested doc fix
-    # Normally, it equals 1 + runtime_draft_len. But for PARD, it equals 2 * runtime_draft_len.
+    # Normally, it equals 1 + runtime_draft_len.
+    # For PARD, it equals 2 * runtime_draft_len when runtime_draft_len > 0, else 1.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/speculative/interface.py` around lines 285 - 287, Update
the comment above the runtime_tokens_per_gen_step variable to clarify the PARD
edge case: explain that normally runtime_tokens_per_gen_step equals 1 +
runtime_draft_len, and for PARD it equals 2 * runtime_draft_len except when K=0
(in which case runtime_tokens_per_gen_step is 1), referencing the PARD mode and
the runtime_draft_len and K variables so readers understand the K=0 special-case
behavior for runtime_tokens_per_gen_step.

Comment on lines +601 to +609
runtime_draft_len = spec_metadata.runtime_draft_len
max_draft_len = self.spec_config.num_nextn_predict_layers

if self.is_thop:
_, _ = torch.ops.trtllm.mtp_update_hidden_states_op(
input_ids, seq_lens, hidden_states,
spec_metadata.mtp_hidden_states_ptrs,
spec_metadata.mtp_past_tokens_ptrs, num_accepted_tokens,
mtp_num_modules, batch_size, num_contexts, hidden_size)
runtime_draft_len, batch_size, num_contexts, hidden_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Python call sites =="
rg -n -C4 'mtp_update_hidden_states_op' tensorrt_llm

echo
echo "== Native op registration / implementation =="
rg -n -C6 'mtp_update_hidden_states(_op)?' . -g '!**/build/**' -g '!**/dist/**'

Repository: NVIDIA/TensorRT-LLM

Length of output: 6303


🏁 Script executed:

sed -n '150,250p' cpp/tensorrt_llm/thop/specDecOp.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 4543


🏁 Script executed:

sed -n '610,660p' tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2889


🏁 Script executed:

sed -n '660,710p' tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2659


🏁 Script executed:

grep -rn "numMTPModules" cpp/tensorrt_llm --include="*.h" --include="*.cpp" -A2 -B2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 8028


🏁 Script executed:

grep -rn "invokeMTPUpdateHiddenStates" cpp/tensorrt_llm --include="*.cu" -A30 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 5513


🏁 Script executed:

grep -n "mtpUpdateHiddenStatesKernel" cpp/tensorrt_llm/kernels/speculativeDecoding/mtpKernels.cu -A60 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4929


🏁 Script executed:

grep -n "mtp_hidden_states_pool\|mtp_past_tokens_pool" tensorrt_llm/_torch/speculative/mtp.py -B2 -A2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 1966


Verify THOP refreshes the full MTP history window.

The kernel code in mtpKernels.cu line 386 and 395 uses numMTPModules directly to control the history window size: for (int ii = 0; ii < numMTPModules; ii++). Since line 609 now passes runtime_draft_len as this parameter, THOP mode will maintain a history window of only runtime_draft_len entries. Meanwhile, the eager path at lines 674-676 always updates max_draft_len entries in the pools. When runtime_draft_len shrinks and later grows, THOP mode will have discarded history that eager mode preserved, causing a divergence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 601 - 609, The THOP
branch calling torch.ops.trtllm.mtp_update_hidden_states_op currently passes
runtime_draft_len which causes THOP to only retain a shortened MTP history;
change the argument to max_draft_len (self.spec_config.num_nextn_predict_layers)
so THOP refreshes the full MTP history window the same way the eager path does,
ensuring both branches update the same number of draft entries (compare the call
in the is_thop block and the eager update that uses max_draft_len).

Comment on lines +439 to +445
@pytest.mark.skip_less_device_memory(60000)
@parametrize_with_ids("enable_max_concurrency,enable_draft_len_schedule", [
(False, True),
(True, False),
])
def test_pard_dynamic_draft_len(self, enable_max_concurrency,
enable_draft_len_schedule):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add Hopper gating for the new PARD dynamic-draft test.

test_pard_dynamic_draft_len is missing @skip_pre_hopper, unlike other PARD tests in this class. This can fail on unsupported pre-Hopper runners.

🔧 Suggested patch
+    `@skip_pre_hopper`
     `@pytest.mark.skip_less_device_memory`(60000)
     `@parametrize_with_ids`("enable_max_concurrency,enable_draft_len_schedule", [
         (False, True),
         (True, False),
     ])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py` around lines 439 -
445, The test function test_pard_dynamic_draft_len is missing the Hopper-gating
decorator; add the `@skip_pre_hopper` decorator immediately above the function
definition so it matches other PARD tests and will be skipped on pre-Hopper
runners; ensure the decorator is imported/available where other tests use
skip_pre_hopper so the new annotation compiles and is applied to
test_pard_dynamic_draft_len.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39151 [ run ] triggered by Bot. Commit: de4c894 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39151 [ run ] completed with state FAILURE. Commit: de4c894
/LLM/main/L0_MergeRequest_PR pipeline #30410 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
@zheyuf zheyuf changed the title [TRTLLM-10319][feat] Expand dynamic draft length to MTP and PARD. [TRTLLM-10319][feat] Expand dynamic speculation to MTP and PARD. Mar 17, 2026
@zheyuf
Copy link
Collaborator Author

zheyuf commented Mar 17, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39310 [ run ] triggered by Bot. Commit: 864f29b Link to invocation

Comment on lines +145 to +146
return self.is_mtp_one_model() or self.is_eagle3_one_model(
) or self.is_pard()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add draft/target support too

task.evaluate(llm, extra_acc_spec="use_sa_spec")

@pytest.mark.skip_less_device_memory(60000)
@parametrize_with_ids("enable_max_concurrency,enable_draft_len_schedule", [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether there's a constraint on sm version?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39310 [ run ] completed with state SUCCESS. Commit: 864f29b
/LLM/main/L0_MergeRequest_PR pipeline #30558 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants