[Spec Decoding] Integrate DFlash into speculative decoding pipeline by aaronzhfeng · Pull Request #1869 · vllm-project/tpu-inference

aaronzhfeng · 2026-03-05T21:55:55Z

Description

Wire DFlash block-diffusion speculative decoding into the existing TPU inference pipeline. The DFlash model and proposer were added in #1868; this PR connects them to the runner, KV cache manager, and speculative decoding manager so DFlash can be used end-to-end.

No changes to existing Eagle3 or ngram code paths: DFlash gets its own propose_dflash_draft_token_ids method and a separate elif "dflash" dispatch branch.

Modified files:

tpu_inference/models/common/model_loader.py -- register DFlashDraftModel in model registry
tpu_inference/models/jax/qwen3.py -- collect aux_hidden_states from target layers during forward pass (needed by DFlash proposer to inject target context)
tpu_inference/runner/tpu_runner.py -- add DFlashProposer initialization for method="dflash"
tpu_inference/runner/speculative_decoding_manager.py -- add dflash method dispatch and propose_dflash_draft_token_ids (uses accepted_attn_metadata with correct seq_lens for drafter)
tpu_inference/runner/kv_cache_manager.py -- extend draft KV cache allocation to cover dflash, read num_hidden_layers from config instead of hardcoding 1

Usage (after both #1868 and this PR):

args['speculative_config'] = {
    'model': 'z-lab/Qwen3-4B-DFlash-b16',
    'num_speculative_tokens': 5,
    'method': 'dflash',
    'draft_tensor_parallel_size': 1,
}

Tests

E2e tests are in a follow-up PR.

Checklist

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

[Spec Decoding] Integrate DFlash into speculative decoding pipeline

20ec612

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

aaronzhfeng requested review from Lumosis, jrplatin, kyuyeunk, mrjunwan-lang, sixiang-google, vipannalla and wenxindongwork as code owners March 5, 2026 21:55

This was referenced Mar 5, 2026

[Spec Decoding] Add DFlash e2e tests and Buildkite CI #1870

Open

[Spec Decoding] Add DFlash model and proposer #1868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec Decoding] Integrate DFlash into speculative decoding pipeline#1869

[Spec Decoding] Integrate DFlash into speculative decoding pipeline#1869
aaronzhfeng wants to merge 1 commit intovllm-project:mainfrom
aaronzhfeng:pr_dflash_1b

aaronzhfeng commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronzhfeng commented Mar 5, 2026

Description

Tests

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant