[TRTLLM-11508][refactor] Merge Eagle3 and MTP-eagle one-model workers by zhaoyangwang-nvidia · Pull Request #12353 · NVIDIA/TensorRT-LLM

zhaoyangwang-nvidia · 2026-03-19T09:24:15Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for MTP Eagle speculative decoding mode for enhanced inference acceleration alongside existing Eagle3 support.
- Introduced relaxed acceptance configuration during reasoning/thinking phases to optimize token generation.
- Extended speculative decoding compatibility to additional model architectures (DeepseekV3, ExaoneMoe, GLM, Nemotron, Qwen3).

Description

Unify Eagle3OneModelWorker (Eagle3 one-model) and MTPEagleWorker
(MTP-eagle one-model) into a single worker class in
tensorrt_llm/_torch/speculative/eagle3.py, branching on
self.is_mtp_eagle = spec_dec_mode.is_mtp_eagle_one_model(). The two
code paths were ~85% duplicated; this PR collapses them into one
implementation while preserving backward-compatible imports.

MTPEagleWorker becomes a thin backward-compatible subclass in
eagle3.py. mtp.py retains a module-level __getattr__ shim so
from tensorrt_llm._torch.speculative.mtp import MTPEagleWorker
continues to resolve.

Key changes

Unified worker (eagle3.py): new helpers
_get_step_all_rank_num_tokens, _run_draft_forward,
_prepare_flash_mla_generation_layout, draft_sampler (TP-aware);
sample_and_accept_draft_tokens gains an input_ids parameter and
the relaxed-thinking acceptance path previously exclusive to
MTPEagleWorker.
Unified metadata (Eagle3OneModelSpecMetadata): new fields
slot_ids and subseq_all_rank_num_tokens. prepare() skips
num_tokens subtraction for MTP-eagle and populates slot_ids from
the resource manager.
Eagle3ResourceManager owns the relaxed-acceptance
relaxed_delta_pool so the Eagle3 path can also use thinking-phase
relaxed acceptance.
Eagle3DraftModel.forward takes an optional
all_rank_num_tokens kwarg and wraps its body in try/finally that
restores attn_metadata.all_rank_num_tokens on exit — the worker
loop no longer mutates attn_metadata for Eagle3.
EagleDecodingConfig gains the five relaxed-acceptance fields
mirrored from MTPDecodingConfig.
SpeculativeDecodingMode.is_mtp_one_model() is narrowed to
vanilla MTP only; MTP_EAGLE_ONE_MODEL becomes a first-class
one-model mode in use_one_engine, without_logits,
needs_kv_cache_rewind, support_overlap_scheduler,
support_capturable_guided_decoder, support_dynamic_draft_len,
has_spec_decoder. Per-model checks in modeling_deepseekv3,
modeling_glm, modeling_exaone_moe, modeling_nemotron_h,
modeling_qwen3_next, modeling_speculative, and model_config
are extended accordingly.
Factory routing (utils.py) routes MTP_EAGLE_ONE_MODEL to
Eagle3OneModelSpecMetadata, Eagle3OneModelSampler,
Eagle3ResourceManager, and the unified worker.
model_engine.py populates
spec_metadata.subseq_all_rank_num_tokens for both Eagle3 and
MTP-eagle one-model at all three attention-DP allgather sites.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

zhaoyangwang-nvidia · 2026-03-19T10:31:09Z

/bot run

tensorrt-cicd · 2026-03-19T10:36:56Z

PR_Github #39589 [ run ] triggered by Bot. Commit: 9f923d9 Link to invocation

tensorrt-cicd · 2026-03-19T14:01:55Z

PR_Github #39589 [ run ] completed with state SUCCESS. Commit: 9f923d9
/LLM/main/L0_MergeRequest_PR pipeline #30802 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zhaoyangwang-nvidia · 2026-03-20T10:16:35Z

/bot run

tensorrt-cicd · 2026-03-20T10:23:07Z

PR_Github #39733 [ run ] triggered by Bot. Commit: 01b1fe7 Link to invocation

tensorrt-cicd · 2026-03-20T14:00:19Z

PR_Github #39733 [ run ] completed with state SUCCESS. Commit: 01b1fe7
/LLM/main/L0_MergeRequest_PR pipeline #30928 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zhaoyangwang-nvidia · 2026-03-22T08:23:13Z

/bot run

tensorrt-cicd · 2026-03-22T08:28:48Z

PR_Github #39814 [ run ] triggered by Bot. Commit: 5a1ef63 Link to invocation

tensorrt-cicd · 2026-03-22T10:21:48Z

PR_Github #39814 [ run ] completed with state SUCCESS. Commit: 5a1ef63
/LLM/main/L0_MergeRequest_PR pipeline #30991 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zhaoyangwang-nvidia · 2026-03-22T10:39:43Z

/bot run

tensorrt-cicd · 2026-03-22T10:46:34Z

PR_Github #39819 [ run ] triggered by Bot. Commit: 5a1ef63 Link to invocation

tensorrt-cicd · 2026-03-22T12:43:21Z

PR_Github #39819 [ run ] completed with state SUCCESS. Commit: 5a1ef63
/LLM/main/L0_MergeRequest_PR pipeline #30996 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zhaoyangwang-nvidia · 2026-03-22T14:03:10Z

/bot run

tensorrt-cicd · 2026-03-22T14:10:00Z

PR_Github #39828 [ run ] triggered by Bot. Commit: 5a1ef63 Link to invocation

tensorrt-cicd · 2026-03-22T16:06:18Z

PR_Github #39828 [ run ] completed with state SUCCESS. Commit: 5a1ef63
/LLM/main/L0_MergeRequest_PR pipeline #31005 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zhaoyangwang-nvidia · 2026-03-26T07:05:07Z

/bot run

tensorrt-cicd · 2026-03-26T07:10:40Z

PR_Github #40431 [ run ] triggered by Bot. Commit: f4f8ac6 Link to invocation

tensorrt-cicd · 2026-03-26T11:13:45Z

PR_Github #40431 [ run ] completed with state FAILURE. Commit: f4f8ac6
/LLM/main/L0_MergeRequest_PR pipeline #31523 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-05-14T13:50:02Z

PR_Github #48373 [ run ] triggered by Bot. Commit: d2f8aee Link to invocation

tensorrt-cicd · 2026-05-14T14:26:29Z

PR_Github #48373 [ run ] completed with state FAILURE. Commit: d2f8aee
/LLM/main/L0_MergeRequest_PR pipeline #38178 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-15T03:19:17Z

/bot run

tensorrt-cicd · 2026-05-15T03:25:31Z

PR_Github #48499 [ run ] triggered by Bot. Commit: d01e724 Link to invocation

tensorrt-cicd · 2026-05-15T06:08:55Z

PR_Github #48499 [ run ] completed with state SUCCESS. Commit: d01e724
/LLM/main/L0_MergeRequest_PR pipeline #38296 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-15T06:23:25Z

/bot run

tensorrt-cicd · 2026-05-15T06:28:37Z

PR_Github #48545 [ run ] triggered by Bot. Commit: d01e724 Link to invocation

tensorrt-cicd · 2026-05-15T08:45:04Z

PR_Github #48545 [ run ] completed with state SUCCESS. Commit: d01e724
/LLM/main/L0_MergeRequest_PR pipeline #38336 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-15T09:07:15Z

/bot run

tensorrt-cicd · 2026-05-15T09:13:15Z

PR_Github #48571 [ run ] triggered by Bot. Commit: d01e724 Link to invocation

tensorrt-cicd · 2026-05-16T08:24:23Z

PR_Github #48571 [ run ] completed with state SUCCESS. Commit: d01e724
/LLM/main/L0_MergeRequest_PR pipeline #38358 completed with status: 'SUCCESS'

CI Report

Link to invocation

zhaoyangwang-nvidia · 2026-05-18T02:57:47Z

Hi @mikeiovine @sunnyqgg @ziyixiong-nv could you help to review this PR, all CI passed and ready for reivew.

ziyixiong-nv · 2026-05-22T06:37:16Z

        self.model_nextn = 0
-        if model_config.spec_config is not None and model_config.spec_config.spec_dec_mode.is_mtp_one_model(
+        if model_config.spec_config is not None and (
+                model_config.spec_config.spec_dec_mode.is_mtp_one_model() or


Seems https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L347 is forgot to be updated, and I think this would be easy to break in the future.

is_mtp_vanilla is what you want for SpeculativeDecodingMode.MTP, so you can use is_mtp_vanilla in some places, and keep the previous is_mtp_one_model for SpeculativeDecodingMode.MTP or SpeculativeDecodingMode.MTP_EAGLE_ONE_MODEL.

Please also ensure that your local test shows the AR won't drop when using MTP Eagle one model.

Done. Restored is_mtp_one_model() to union (MTP or MTP_EAGLE_ONE_MODEL) and used is_mtp_vanilla() where only vanilla MTP applies.

Test result：
Verified MTP Eagle one-model AR is unchanged after the unified worker refactor:

Pre-refactor (ce788e0) Post-refactor (this PR)

Acceptance rate 81.38% 80.66%

Avg acceptance length 1.814 / 2.0 1.805 / 2.0

Tested on Qwen3.5-9B, 16 real prompts × 128 output tokens, single B200. Difference is within run-to-run noise.

Unify the Eagle3 one-model and MTP-eagle one-model speculative-decoding workers into a single Eagle3OneModelWorker in eagle3.py, branching on self.is_mtp_eagle. MTPEagleWorker becomes a thin backward-compatible subclass; mtp.py keeps a module-level __getattr__ shim so the historical import path continues to resolve. Key changes: - Eagle3OneModelSpecMetadata gains slot_ids and subseq_all_rank_num_tokens; prepare() skips num_tokens adjustment for MTP-eagle and populates slot_ids from the resource manager. - Eagle3ResourceManager owns the relaxed-acceptance delta pool for both modes. - New helpers _get_step_all_rank_num_tokens, _run_draft_forward, and _prepare_flash_mla_generation_layout encapsulate the per-step branching. - sample_and_accept_draft_tokens takes input_ids and supports the relaxed-thinking path previously exclusive to MTPEagleWorker. - EagleDecodingConfig grows the relaxed-acceptance fields mirrored from MTPDecodingConfig. - SpeculativeDecodingMode.is_mtp_one_model() now means vanilla MTP only; predicates and per-model checks are extended to recognize MTP_EAGLE_ONE_MODEL as a first-class one-model mode. - The Eagle3 _saved_kv_lens_cuda save/restore is dropped (relying on attn_metadata.update_for_spec_dec() instead) - needs verification under Eagle3 regression tests. - Factory routing in utils.py routes MTP_EAGLE_ONE_MODEL to the unified Eagle3 metadata, sampler, resource manager, and worker. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

Move the per-step ``all_rank_num_tokens`` plumbing into the draft model itself so the unified Eagle3OneModelWorker no longer mutates ``attn_metadata`` on the way into the draft loop. - Eagle3DraftModel.forward takes an optional ``all_rank_num_tokens`` kwarg and wraps its body in try/finally that restores ``attn_metadata.all_rank_num_tokens`` on exit. - _run_draft_forward in eagle3.py passes ``all_rank_num_tokens`` via ``inputs`` for Eagle3 (kwarg to Eagle3DraftModel) and as a direct kwarg to ``mtp_layers[0]`` for MTP Eagle; the worker no longer needs the old fallback parameter. - _get_step_all_rank_num_tokens reads only from spec_metadata (all_rank_num_tokens at step 0, subseq_all_rank_num_tokens otherwise). - model_engine.py populates ``spec_metadata.subseq_all_rank_num_tokens`` for both Eagle3 one-model and MTP-eagle one-model at all three sites that allgather per-rank token counts. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

Address review feedback: keep is_mtp_one_model() covering both MTP and MTP_EAGLE_ONE_MODEL (matches main), and use is_mtp_vanilla() only where the call should match vanilla MTP exclusively. Drop the JIRA tag from the NOTE comment in eagle3.py and simplify the now-redundant "is_mtp_one_model() or is_mtp_eagle_one_model()" patterns introduced by the merge. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

…rward The one-model worker forward signature (Eagle3 / MTP-Eagle) takes ``resource_manager``, and modeling_speculative.py forwards it unconditionally to ``self.spec_worker(...)``. On non-last PP ranks, ``forward`` is replaced by ``skip_forward`` via modeling_utils.skip_forward(), which raised ``TypeError: SpecWorkerBase.skip_forward() got an unexpected keyword argument 'resource_manager'`` and silently terminated the executor worker. Before the merge, MTP-Eagle used MTPWorker.skip_forward which already accepted ``resource_manager``; the merged path now inherits SpecWorkerBase.skip_forward, which did not. Add the parameter (unused) to restore PP compatibility. Validated on H200 x4 with TestDeepSeekV3Lite::test_bfloat16_4gpus[tp2pp2-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] (was: silent crash during warmup; now: PASSED, GSM8K 63.72). Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

Pre-commit yapf reformatted the position_ids update in the unified linear draft loop. No functional change. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia · 2026-05-25T10:55:52Z

/bot run

tensorrt-cicd · 2026-05-25T11:01:41Z

PR_Github #50210 [ run ] triggered by Bot. Commit: 9733b00 Link to invocation

tensorrt-cicd · 2026-05-25T12:37:12Z

PR_Github #50210 [ run ] completed with state SUCCESS. Commit: 9733b00
/LLM/main/L0_MergeRequest_PR pipeline #39747 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

zhaoyangwang-nvidia · 2026-05-25T13:57:51Z

/bot run

tensorrt-cicd · 2026-05-25T14:03:27Z

PR_Github #50222 [ run ] triggered by Bot. Commit: 9733b00 Link to invocation

tensorrt-cicd · 2026-05-25T15:26:52Z

PR_Github #50222 [ run ] completed with state SUCCESS. Commit: 9733b00
/LLM/main/L0_MergeRequest_PR pipeline #39759 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned zhaoyangwang-nvidia Mar 19, 2026

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from 98b1417 to 9f923d9 Compare March 19, 2026 09:52

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from 9f923d9 to 01b1fe7 Compare March 20, 2026 10:15

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from 01b1fe7 to 5a1ef63 Compare March 22, 2026 08:23

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch 2 times, most recently from 7e82c9c to f4f8ac6 Compare March 24, 2026 09:06

zhaoyangwang-nvidia changed the title ~~[TRTLLM-11508][refactor] Merge eagle mtp~~ [TRTLLM-11508][refactor] Merge Eagle3 and MTP-eagle one-model workers May 11, 2026

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch 2 times, most recently from 8ed37a4 to 67e4428 Compare May 12, 2026 09:57

zhaoyangwang-nvidia marked this pull request as ready for review May 12, 2026 09:58

zhaoyangwang-nvidia requested review from a team as code owners May 12, 2026 09:58

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from d2f8aee to d01e724 Compare May 15, 2026 03:15

ziyixiong-nv reviewed May 22, 2026

View reviewed changes

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from d01e724 to eb4643f Compare May 25, 2026 10:41

zhaoyangwang-nvidia added 7 commits May 25, 2026 18:55

clean code

d8c4cbe

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

fix some issue

59daa09

Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

[TRTLLM-11508][chore] apply yapf formatting on Eagle3OneModelWorker

9733b00

Pre-commit yapf reformatted the position_ids update in the unified linear draft loop. No functional change. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia force-pushed the merge-eagle-mtp branch from 79bbe75 to 9733b00 Compare May 25, 2026 10:55

ziyixiong-nv approved these changes May 25, 2026

View reviewed changes

	Pre-refactor (`ce788e0`)	Post-refactor (this PR)
Acceptance rate	81.38%	80.66%
Avg acceptance length	1.814 / 2.0	1.805 / 2.0
Tested on Qwen3.5-9B, 16 real prompts × 128 output tokens, single B200. Difference is within run-to-run noise.

Conversation

zhaoyangwang-nvidia commented Mar 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Key changes

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

zhaoyangwang-nvidia commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

zhaoyangwang-nvidia commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

zhaoyangwang-nvidia commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

zhaoyangwang-nvidia commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 16, 2026

Uh oh!

zhaoyangwang-nvidia commented May 18, 2026

Uh oh!

Uh oh!

ziyixiong-nv May 22, 2026

Choose a reason for hiding this comment

Uh oh!

zhaoyangwang-nvidia May 25, 2026

Choose a reason for hiding this comment

Uh oh!

zhaoyangwang-nvidia commented May 25, 2026

Uh oh!

zhaoyangwang-nvidia commented Mar 19, 2026 •

edited by coderabbitai Bot

Loading