Skip to content

[FEAT][SpecDecode] Add DP attention support for DFLASH speculative decoding#29506

Open
EanWang211123 wants to merge 2 commits into
sgl-project:mainfrom
EanWang211123:feat/dflash/dp-attn
Open

[FEAT][SpecDecode] Add DP attention support for DFLASH speculative decoding#29506
EanWang211123 wants to merge 2 commits into
sgl-project:mainfrom
EanWang211123:feat/dflash/dp-attn

Conversation

@EanWang211123

@EanWang211123 EanWang211123 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Motivation

DFLASH speculative decoding previously rejected --enable-dp-attention at startup. This blocks deployments that combine DFLASH with data-parallel attention (e.g. --tensor-parallel-size 4 --dp-size 2 --enable-dp-attention), which is a common setup for large MoE models like GLM-5.

EAGLE3 already supports DP attention by running each draft worker inside the attention TP group (attn_tp_group). DFLASH should follow the same pattern, but it has additional constraints: it materializes target hidden states directly into the draft KV cache (instead of re-running a draft extend forward), and it performs draft greedy sampling over the target lm_head. These paths need explicit alignment with DP/EP padding, CUDA graph capture modes, and full-TP target verify collectives.

Modifications

speculative_hook.py

  • Remove the guard that blocked DFLASH + enable_dp_attention.
  • Auto-enable enable_dp_lm_head when DFLASH runs with DP attention, so draft greedy sampling's vocab-parallel all_gather stays within the attention TP group (matching lm_head sharding). Without this, a global-TP all_gather mixes tokens across DP groups and deadlocks when a peer DP group is IDLE.

dflash_worker_v2.py

Draft worker initialization (mirrors EAGLE3 + dp_attention)

  • Disable dp_attention on the draft server args (draft is dense; keeps KV row count aligned with out_cache_loc).
  • Create the draft worker inside draft_tp_context(get_attention_tp_group()) so KV head partitioning matches token_to_kv_pool.row_dim (both use attn_tp_size).
  • Wrap draft init_attention_backends, init_cuda_graphs, and runtime draft forward + greedy sampling in draft_tp_context.

Prefill / extend path

  • Early-return for non-extend DP ranks when is_extend_in_batch=True is broadcast globally but the local rank is IDLE/DECODE (avoids missing extend_lens / prefix_lens).
  • Trim trailing EP padding rows from target hidden_states before writing into draft KV when moe_ep_size > 1 (fixes cache_loc vs target_hidden length mismatch on non-aligned token counts).

Decode path

  • IDLE DP ranks participate in target verify (full-TP collective) with capture_hidden_mode=FULL to match active ranks. A NULL mode mismatch triggers a CUDA graph recapture whose internal barrier active ranks never enter, causing deadlock.
  • Move _greedy_sample_from_vocab_parallel_head inside draft_tp_context so its all_gather uses the attention TP group.

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28285509303
Latest PR Test (Extra): ❌ Run #28285509277

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for DP attention in DFLASH speculative decoding. It automatically enables enable_dp_lm_head when DP attention is enabled, runs draft initialization and operations within the attention TP group context, and handles IDLE/DECODE ranks during prefill and verification to prevent deadlocks. Additionally, it handles right-padded hidden states when using MoE Expert Parallelism. The reviewer identified a critical runtime bug where self.draft_tp_context is assigned to empty_context when DP attention is disabled, which will raise a TypeError because empty_context does not accept the positional argument passed to it in multiple places. The reviewer suggested using a lambda wrapper to safely ignore the argument.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +179 to +181
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else empty_context
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When enable_dp_attention is False, self.draft_tp_context is assigned to empty_context. However, empty_context is a 0-argument context manager, whereas self.draft_tp_context is called with a positional argument (e.g., self.draft_tp_context(self.draft_model_runner.tp_group)) in multiple places (lines 314, 328, 1562). This will raise a TypeError at runtime and crash the server on startup when DP attention is disabled. Using a lambda wrapper like lambda _: empty_context() ensures that the single positional argument is safely ignored.

Suggested change
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else empty_context
)
self.draft_tp_context = (
draft_tp_context if server_args.enable_dp_attention else lambda _: empty_context()
)

@EanWang211123 EanWang211123 marked this pull request as ready for review June 27, 2026 09:42
@EanWang211123 EanWang211123 marked this pull request as draft June 27, 2026 09:42
@EanWang211123 EanWang211123 marked this pull request as ready for review June 27, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant