[Spec] Add DSpark speculative decoding for DeepSeek-V4 by adityakamat24 · Pull Request #29538 · sgl-project/sglang

adityakamat24 · 2026-06-28T02:18:29Z

Motivation

DeepSeek shipped DeepSeek-V4-Flash-DSpark and DeepSeek-V4-Pro-DSpark: the V4 checkpoints with a built-in speculative module ("DSpark") under the mtp.* namespace. SGLang already serves DeepSeek-V4 and already has block-draft speculative decoding, but not the DSpark drafter. This PR adds it.

DSpark is a block drafter. Each decode step it proposes a block of block_size tokens (default 5) in one draft forward, and the target verifies the whole block in one pass. The drafter is a three-stage MTP stack with a Markov head (refines the block autoregressively) and a confidence head (scores each position). It runs off the target's fused hidden states rather than its own token KV, so there is no separate draft model to download and the extra memory is small. Verification is greedy and lossless: the accepted output is identical to the target's greedy decoding for any block size or confidence threshold.

Modifications

Core implementation

python/sglang/srt/models/deepseek_v4_dspark.py (new, 261 lines): the DSpark draft model (three-stage MTP backbone, Markov head, confidence head).
python/sglang/srt/speculative/dspark_worker_v2.py (new, 573 lines): block draft, target verify, longest greedy-matching prefix plus bonus token.
python/sglang/srt/speculative/dspark_info.py (new, 328 lines): DSparkDraftInputV2 and DSparkVerifyInput with filter/merge for batching and the overlap scheduler.
python/sglang/srt/models/deepseek_v4.py (+102): target hidden-state capture seam and the per-layer draft KV materialization (MQALayer.kv_from_hidden). Draft and target share the embedding and lm_head; the checkpoint has no draft-private copies.

Registration and dispatch

python/sglang/srt/speculative/spec_info.py (+37): registers SpeculativeAlgorithm.DSPARK, the is_dspark() predicate, and the DSpark draft/verify input types.
python/sglang/srt/speculative/spec_registry.py (+3): conformance stub.
python/sglang/srt/speculative/spec_utils.py (+5): skip the hidden-states placeholder indexing for DSpark under the overlap scheduler.
python/sglang/srt/arg_groups/speculative_hook.py (+101): _handle_dspark (pins topk and num-steps to 1, defaults the draft path to the target, validates block size and confidence threshold).
python/sglang/srt/configs/model_config.py (+18), configs/deepseek_v4.py (+5), arg_groups/deepseek_v4_hook.py: draft architecture swap and config wiring.
python/sglang/srt/model_executor/model_runner.py (+42), model_executor/pool_configurator.py (+21), model_executor/runner/{decode,prefill}_cuda_graph_runner.py, layers/attention/{deepseek_v4_backend,flashinfer_backend}.py, managers/overlap_utils.py: dispatch sites, draft KV pool sizing, and CUDA graph capture.
python/sglang/srt/server_args.py (+8): the DSpark CLI flags.

Configuration parameters

--speculative-algorithm DSPARK: enable DSpark. The draft path defaults to the target checkpoint.
--speculative-dspark-block-size (default 5): draft block length; alias of --speculative-num-draft-tokens for DSpark.
--speculative-dspark-confidence-threshold (default 0.0, range [0, 1]): truncate the draft block at the first position whose confidence-head probability falls below the threshold; 0 verifies the full block.
--speculative-eagle-topk and --speculative-num-steps are pinned to 1.

Documentation

docs/advanced_features/dspark_speculative_decoding.md (new): user guide (how it works, launch command, request example, parameters, confidence-threshold tuning, accuracy and performance, constraints).
docs/advanced_features/speculative_decoding.md: DSpark added to the quick-guidance list and the method comparison table.
docs/advanced_features/server_arguments.md: DSPARK added to the algorithm options plus the two DSpark flags.
docs/index.rst: the new doc wired into the Advanced Features toctree.

Tests

test/registered/unit/spec/test_dspark.py (new, 187 lines): predicate truth table, _handle_dspark arg handling (block-size alias conflict, bounds, dp-attention and pipeline-parallel rejection, draft-path defaulting), and a DSparkDraftInputV2 filter/merge regression (the asymmetric merge that previously left a stale-sized array for filter_batch to index out of bounds, plus the overlap placeholder fields). Registered on base-a-test-cpu.

Accuracy Tests

Greedy (temperature 0), 200 examples each, DeepSeek-V4-Flash-DSpark on 8x B200. Accuracy matches the no-speculation target within run-to-run noise (FP4 MoE is not byte-exact across runs, so greedy accuracy match is the correctness bar):

Eval	No spec	DSpark
GSM8K	0.985	0.990
MMLU	0.900	0.895

MMLU no-spec category breakdown: stem 0.952, humanities 0.839, social sciences 0.918, other 0.915.

Benchmarking and Profiling

Configuration

Model: deepseek-ai/DeepSeek-V4-Flash-DSpark (284B, FP4 MoE)
Hardware: 8x B200, tp8 ep8
MoE backend: flashinfer_mxfp4 (target and draft)
Sampling: greedy (temperature 0)
Acceptance length: mean accepted tokens per target forward, bonus included; no spec equals 1.00
The no-spec baseline runs at --mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16; without a draft model it does not fit at the DSpark memory settings, which is also why the no-spec sweep stops at concurrency 16. The single-stream and low-concurrency rows are unaffected; treat the concurrency 8 and 16 rows as indicative.

Commands

Launch (DSpark):

python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V4-Flash-DSpark \
    --trust-remote-code --tp 8 --ep-size 8 \
    --moe-runner-backend flashinfer_mxfp4 \
    --speculative-moe-runner-backend flashinfer_mxfp4 \
    --speculative-algorithm DSPARK \
    --speculative-eagle-topk 1 --speculative-num-steps 1 \
    --mem-fraction-static 0.85 --context-length 4096 \
    --cuda-graph-max-bs 32 --max-running-requests 32 \
    --disable-overlap-schedule

Launch (no spec baseline): the same command without the --speculative-* flags and --disable-overlap-schedule, with --mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16.

Accuracy:

python3 -m sglang.test.run_eval --port 30000 --eval-name gsm8k --num-examples 200
python3 -m sglang.test.run_eval --port 30000 --eval-name mmlu --num-examples 200

Throughput and latency sweep (per concurrency cc):

python3 -m sglang.bench_serving --backend sglang --port 30000 \
    --dataset-name random --random-input-len 1024 --random-output-len 1024 \
    --random-range-ratio 1.0 --num-prompts $((cc*6)) \
    --max-concurrency $cc --request-rate inf

Acceptance length by domain: averaged the per-request meta_info.spec_accept_length returned by the server over 40 greedy prompts per dataset.

Results

Acceptance length by domain (tokens per target forward, bonus included; no spec equals 1.00):

	GSM8K	MATH500	HumanEval	MBPP	MT-Bench	Alpaca
DSpark	3.21	3.53	3.61	3.43	2.91	3.09

Per-domain mean: math 3.37, code 3.52, chat 3.00.

Single-stream speedup (same server, with and without DSpark):

Input / output	No spec (tok/s)	DSpark (tok/s)	Speedup
512 / 256	154	228	1.48x
1024 / 1024	164	297	1.81x

Throughput and latency across concurrency (random 1024/1024, request-rate inf):

Concurrency	No spec (tok/s)	DSpark (tok/s)	Speedup	DSpark median TPOT (ms)	DSpark accept len
1	164.7	297.6	1.81x	2.93	3.41
4	585.6	607.8	1.04x	4.65	3.54
8	608.4	735.1	1.21x	6.75	3.74
16	601.8	788.3	1.31x	5.11	4.01
32	n/a	761.3	n/a	5.16	4.25

DSpark gives the largest gain on single-stream and low-to-mid concurrency. As the batch fills and decode becomes compute-bound, the no-spec baseline plateaus around 600 tok/s while DSpark keeps climbing to roughly 790 tok/s at concurrency 16.

Raw benchmark output

Accuracy (sglang.test.run_eval, 200 examples, greedy):

GSM8K  no-spec : Score: 0.985
GSM8K  DSpark  : Score: 0.990
MMLU   no-spec : Score: 0.900   (stem 0.952, humanities 0.839, social_sciences 0.918, other 0.915)
MMLU   DSpark  : Score: 0.895

Acceptance length by domain (DSpark, greedy, n=40 per dataset):

gsm8k     [math]  3.214
math500   [math]  3.531
humaneval [code]  3.608
mbpp      [code]  3.431
mtbench   [chat]  2.913
alpaca    [chat]  3.094
per-domain: math 3.373   code 3.520   chat 3.004

Throughput sweep (sglang.bench_serving, random 1024/1024, request-rate inf):

DSpark   c=1   297.6 tok/s   median TPOT 2.93 ms   accept 3.41
DSpark   c=4   607.8 tok/s   median TPOT 4.65 ms   accept 3.54
DSpark   c=8   735.1 tok/s   median TPOT 6.75 ms   accept 3.74
DSpark   c=16  788.3 tok/s   median TPOT 5.11 ms   accept 4.01
DSpark   c=32  761.3 tok/s   median TPOT 5.16 ms   accept 4.25
no-spec  c=1   164.7 tok/s   median TPOT 6.00 ms
no-spec  c=4   585.6 tok/s   median TPOT 6.68 ms
no-spec  c=8   608.4 tok/s   median TPOT 7.24 ms
no-spec  c=16  601.8 tok/s   median TPOT 7.26 ms

Scope

Greedy (temperature 0) verification. Requests with temperature > 0 are served with greedy verification and log a one-time warning; rejection sampling on the draft distribution (which the Markov-argmax drafter does not expose yet) is a follow-up. The throughput sweep was run with --disable-overlap-schedule; the draft-input structures carry the overlap-scheduler fields, and validating overlap-on throughput is a follow-up.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

CI States

Latest PR Test (Base): ❌ Run #28309481835
Latest PR Test (Extra): ❌ Run #28309481732

gemini-code-assist · 2026-06-28T02:18:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

whybeyoung · 2026-06-28T03:36:36Z

/tag-and-rerun-ci

[Spec] Add DSpark block speculative decoding for DeepSeek-V4

4b4585d

adityakamat24 requested review from Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners June 28, 2026 02:18

github-actions Bot added deepseek speculative-decoding labels Jun 28, 2026

adityakamat24 mentioned this pull request Jun 28, 2026

[Feature] Support DSpark Speculative Decoding for DeepSeek V4 #29488

Open

2 tasks

[Docs] Document DSpark speculative decoding

ef2b9f0

adityakamat24 requested review from JustinTong0323, sogalin, wisclmy0611 and zijiexia as code owners June 28, 2026 03:00

github-actions Bot added the documentation Improvements or additions to documentation label Jun 28, 2026

adityakamat24 changed the title ~~[Spec] Add DSpark block speculative decoding for DeepSeek-V4~~ [Spec] Add DSpark speculative decoding for DeepSeek-V4 Jun 28, 2026

github-actions Bot added the run-ci label Jun 28, 2026

Fridge003 removed the run-ci label Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538

[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538
adityakamat24 wants to merge 2 commits into
sgl-project:mainfrom
adityakamat24:dspark

adityakamat24 commented Jun 28, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jun 28, 2026

Uh oh!

whybeyoung commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

adityakamat24 commented Jun 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Core implementation

Registration and dispatch

Configuration parameters

Documentation

Tests

Accuracy Tests

Benchmarking and Profiling

Configuration

Commands

Results

Scope

Checklist

CI States

Uh oh!

gemini-code-assist Bot commented Jun 28, 2026

Uh oh!

whybeyoung commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adityakamat24 commented Jun 28, 2026 •

edited by github-actions Bot

Loading