[FEAT][SpecDecode] Add DP attention support for DFLASH speculative decoding by EanWang211123 · Pull Request #29506 · sgl-project/sglang

EanWang211123 · 2026-06-27T09:40:24Z

Motivation

DFLASH speculative decoding previously rejected --enable-dp-attention at startup. This blocks deployments that combine DFLASH with data-parallel attention (e.g. --tensor-parallel-size 4 --dp-size 2 --enable-dp-attention), which is a common setup for large MoE models like GLM-5.

EAGLE3 already supports DP attention by running each draft worker inside the attention TP group (attn_tp_group). DFLASH should follow the same pattern, but it has additional constraints: it materializes target hidden states directly into the draft KV cache (instead of re-running a draft extend forward), and it performs draft greedy sampling over the target lm_head. These paths need explicit alignment with DP/EP padding, CUDA graph capture modes, and full-TP target verify collectives.

Modifications

`speculative_hook.py`

Remove the guard that blocked DFLASH + enable_dp_attention.
Auto-enable enable_dp_lm_head when DFLASH runs with DP attention, so draft greedy sampling's vocab-parallel all_gather stays within the attention TP group (matching lm_head sharding). Without this, a global-TP all_gather mixes tokens across DP groups and deadlocks when a peer DP group is IDLE.

`dflash_worker_v2.py`

Draft worker initialization (mirrors EAGLE3 + dp_attention)

Disable dp_attention on the draft server args (draft is dense; keeps KV row count aligned with out_cache_loc).
Create the draft worker inside draft_tp_context(get_attention_tp_group()) so KV head partitioning matches token_to_kv_pool.row_dim (both use attn_tp_size).
Wrap draft init_attention_backends, init_cuda_graphs, and runtime draft forward + greedy sampling in draft_tp_context.

Prefill / extend path

Early-return for non-extend DP ranks when is_extend_in_batch=True is broadcast globally but the local rank is IDLE/DECODE (avoids missing extend_lens / prefix_lens).
Trim trailing EP padding rows from target hidden_states before writing into draft KV when moe_ep_size > 1 (fixes cache_loc vs target_hidden length mismatch on non-aligned token counts).

Decode path

IDLE DP ranks participate in target verify (full-TP collective) with capture_hidden_mode=FULL to match active ranks. A NULL mode mismatch triggers a CUDA graph recapture whose internal barrier active ranks never enter, causing deadlock.
Move _greedy_sample_from_vocab_parallel_head inside draft_tp_context so its all_gather uses the attention TP group.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28285509303
Latest PR Test (Extra): ❌ Run #28285509277

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

gemini-code-assist

Code Review

This pull request adds support for DP attention in DFLASH speculative decoding. It automatically enables enable_dp_lm_head when DP attention is enabled, runs draft initialization and operations within the attention TP group context, and handles IDLE/DECODE ranks during prefill and verification to prevent deadlocks. Additionally, it handles right-padded hidden states when using MoE Expert Parallelism. The reviewer identified a critical runtime bug where self.draft_tp_context is assigned to empty_context when DP attention is disabled, which will raise a TypeError because empty_context does not accept the positional argument passed to it in multiple places. The reviewer suggested using a lambda wrapper to safely ignore the argument.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T09:41:37Z

+        self.draft_tp_context = (
+            draft_tp_context if server_args.enable_dp_attention else empty_context
+        )


When enable_dp_attention is False, self.draft_tp_context is assigned to empty_context. However, empty_context is a 0-argument context manager, whereas self.draft_tp_context is called with a positional argument (e.g., self.draft_tp_context(self.draft_model_runner.tp_group)) in multiple places (lines 314, 328, 1562). This will raise a TypeError at runtime and crash the server on startup when DP attention is disabled. Using a lambda wrapper like lambda _: empty_context() ensures that the single positional argument is safely ignored.

Suggested change

self.draft_tp_context = (

draft_tp_context if server_args.enable_dp_attention else empty_context

)

self.draft_tp_context = (

draft_tp_context if server_args.enable_dp_attention else lambda _: empty_context()

)

[feat] add dp-attn support for dflash

a5c32f0

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

github-actions Bot added the speculative-decoding label Jun 27, 2026

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

EanWang211123 marked this pull request as ready for review June 27, 2026 09:42

EanWang211123 requested review from Qiaolin-Yu, Ying1123, hnyls2002 and merrymercy as code owners June 27, 2026 09:42

Merge branch 'main' into feat/dflash/dp-attn

4ef98dc

EanWang211123 marked this pull request as draft June 27, 2026 09:42

EanWang211123 marked this pull request as ready for review June 27, 2026 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT][SpecDecode] Add DP attention support for DFLASH speculative decoding#29506

[FEAT][SpecDecode] Add DP attention support for DFLASH speculative decoding#29506
EanWang211123 wants to merge 2 commits into
sgl-project:mainfrom
EanWang211123:feat/dflash/dp-attn

EanWang211123 commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EanWang211123 commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

speculative_hook.py

dflash_worker_v2.py

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EanWang211123 commented Jun 27, 2026 •

edited by github-actions Bot

Loading

`speculative_hook.py`

`dflash_worker_v2.py`