[Spec Decoding] Add DFlash model and proposer by aaronzhfeng · Pull Request #1868 · vllm-project/tpu-inference

aaronzhfeng · 2026-03-05T20:48:20Z

Description

Add DFlash draft model and proposer for block-diffusion speculative decoding on JAX/TPU. DFlash predicts multiple tokens in parallel using discrete diffusion, unlike Eagle3's autoregressive drafting. This follows the same proposer pattern as Eagle3.

This is PR 1 of 3 for DFlash support:

This PR: Model, proposer, and unit tests (all new files)
Pipeline integration (modifications to existing files)
E2E tests and Buildkite CI

New files:

tpu_inference/models/jax/dflash.py -- DFlash draft model (DFlashForCausalLM)
tpu_inference/models/jax/qwen3_dflash.py -- Qwen3-specific DFlash variant with attention
tpu_inference/layers/common/dflash_attention_interface.py -- dflash_concat_attention kernel
tpu_inference/spec_decode/jax/dflash.py -- DFlashProposer (prepare_inputs, propose, sampling)
tests/models/jax/test_qwen3_dflash_attention.py -- DFlash attention unit tests
tests/models/jax/test_qwen3_dflash.py -- target layer ID selection tests
tests/spec_decode/test_dflash.py -- proposer sampling tests

Tests

Unit tests for DFlash attention (concat, additive bias, GQA): tests/models/jax/test_qwen3_dflash_attention.py
Unit tests for target layer ID selection: tests/models/jax/test_qwen3_dflash.py
Unit tests for proposer sampling: tests/spec_decode/test_dflash.py

Checklist

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Lumosis · 2026-03-05T21:00:48Z

This is a large PR. Can we break it down into several small PRs to make review easier?

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

aaronzhfeng · 2026-03-05T21:24:45Z

Sorry about the large PR. The model, proposer, and attention kernel are tightly coupled (proposer calls model forward, model uses the attention kernel), so splitting them further would leave each PR non-functional on its own. All files here are new additions with no changes to existing code, which should make it easier to review.

Broke the original PR down into 3:

This PR (updated): DFlash model, proposer, and unit tests -- all new files, no existing files modified
Pipeline integration -- modifications to existing files (speculative_decoding_manager, kv_cache_manager, qwen3, tpu_runner, model_loader)
E2E tests + Buildkite CI

PRs 2 and 3 coming shortly.

kyuyeunk

Hi @aaronzhfeng! thank you for the contribution. couple of questions

for those who aren't familiar with DFlash (like myself), can you give a brief overview & maybe a link where we can find out about more info?
is my understanding correct that this feature is not available in vllm's pytorch model implementation? if so, is there a way for a backend that utilizes vllm's model implementation to leverage this spec decoding?
can you share a sample command for people to try out this feature while going through the review process?

aaronzhfeng · 2026-03-07T23:56:37Z

Thanks for taking a look!

DFlash overview: DFlash is a block-diffusion speculative decoding method that predicts multiple tokens in parallel using discrete diffusion, instead of generating them one at a time autoregressively. Given a context, the draft model takes a block of masked/noise positions and denoises them in a single forward pass to produce K candidate tokens simultaneously. This makes drafting O(1) in block size rather than O(K).

Paper: "DFlash: Block Diffusion for Flash Speculative Decoding" (Chen et al., arXiv:2602.06036). The reference GPU implementation is at https://github.com/z-lab/dflash.

PyTorch/vLLM availability: Right now there is no DFlash support in vLLM's PyTorch backend. The DFlash authors have confirmed vLLM integration is still in progress on their end (see z-lab/dflash#6). SGLang has DFlash support via sgl-project/sglang#16818, but this PR would be the first DFlash integration in the vLLM ecosystem. It targets the JAX/TPU backend specifically, since the draft model uses non-causal attention which required a different attention path from the standard causal pipeline. A PyTorch port is feasible but not in scope for this PR.

Sample command: The unit tests in this PR can be run without a full serving setup:

pytest tests/models/jax/test_qwen3_dflash_attention.py
pytest tests/models/jax/test_qwen3_dflash.py
pytest tests/spec_decode/test_dflash.py

End-to-end serving requires the pipeline integration in PR #1869 (already open). Once both are merged, with Qwen3-4B on a TPU v5p-8:

python -m tpu_inference.entrypoint \
  --model Qwen/Qwen3-4B \
  --speculative_config '{"model": "z-lab/Qwen3-4B-DFlash-b16", "num_speculative_tokens": 15, "method": "dflash", "draft_tensor_parallel_size": 1}'

kyuyeunk · 2026-03-11T03:51:49Z

tpu_inference/layers/common/dflash_attention_interface.py

can you elaborate what kind of feature is missing from existing attention implementation that it requires its own separate code? if it's due to bi-directional attention, we already have an implementation for that.

kyuyeunk · 2026-03-11T03:52:24Z

tpu_inference/layers/common/dflash_attention_interface.py

+
+
+@functools.partial(jax.jit, static_argnames=("max_query_len", ))
+def dflash_concat_attention(


in general, i think this function is lacking a lot of comments explaning what each line does.

kyuyeunk · 2026-03-11T03:54:53Z

tpu_inference/spec_decode/jax/dflash.py

@Lumosis can you help take a look at spec decoding part?

aaronzhfeng requested review from Lumosis, QiliangCui, jcyang43, jrplatin, kyuyeunk, mrjunwan-lang, sixiang-google, vipannalla and wenxindongwork as code owners March 5, 2026 20:48

[Spec Decoding] Add DFlash model and proposer

888650b

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

aaronzhfeng force-pushed the pr_dflash_1 branch from 7fa8354 to 888650b Compare March 5, 2026 21:11

aaronzhfeng changed the title ~~[Spec Decoding] Add DFlash block-diffusion speculative decoding~~ [Spec Decoding] Add DFlash model and proposer Mar 5, 2026

Lumosis added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 5, 2026

This was referenced Mar 5, 2026

[Spec Decoding] Integrate DFlash into speculative decoding pipeline #1869

Open

[Spec Decoding] Add DFlash e2e tests and Buildkite CI #1870

Open

kyuyeunk reviewed Mar 7, 2026

View reviewed changes

kyuyeunk reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec Decoding] Add DFlash model and proposer#1868

[Spec Decoding] Add DFlash model and proposer#1868
aaronzhfeng wants to merge 1 commit intovllm-project:mainfrom
aaronzhfeng:pr_dflash_1

aaronzhfeng commented Mar 5, 2026 •

edited

Loading

Uh oh!

Lumosis commented Mar 5, 2026

Uh oh!

aaronzhfeng commented Mar 5, 2026

Uh oh!

kyuyeunk left a comment

Uh oh!

aaronzhfeng commented Mar 7, 2026

Uh oh!

kyuyeunk Mar 11, 2026

Uh oh!

kyuyeunk Mar 11, 2026

Uh oh!

kyuyeunk Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@functools.partial(jax.jit, static_argnames=("max_query_len", ))
		def dflash_concat_attention(

Conversation

aaronzhfeng commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Lumosis commented Mar 5, 2026

Uh oh!

aaronzhfeng commented Mar 5, 2026

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

aaronzhfeng commented Mar 7, 2026

Uh oh!

kyuyeunk Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

kyuyeunk Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aaronzhfeng commented Mar 5, 2026 •

edited

Loading