Skip to content

[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538

Open
adityakamat24 wants to merge 2 commits into
sgl-project:mainfrom
adityakamat24:dspark
Open

[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538
adityakamat24 wants to merge 2 commits into
sgl-project:mainfrom
adityakamat24:dspark

Conversation

@adityakamat24

@adityakamat24 adityakamat24 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Closes #29488.

Motivation

DeepSeek shipped DeepSeek-V4-Flash-DSpark and DeepSeek-V4-Pro-DSpark: the V4 checkpoints with a built-in speculative module ("DSpark") under the mtp.* namespace. SGLang already serves DeepSeek-V4 and already has block-draft speculative decoding, but not the DSpark drafter. This PR adds it.

DSpark is a block drafter. Each decode step it proposes a block of block_size tokens (default 5) in one draft forward, and the target verifies the whole block in one pass. The drafter is a three-stage MTP stack with a Markov head (refines the block autoregressively) and a confidence head (scores each position). It runs off the target's fused hidden states rather than its own token KV, so there is no separate draft model to download and the extra memory is small. Verification is greedy and lossless: the accepted output is identical to the target's greedy decoding for any block size or confidence threshold.

Modifications

Core implementation

  • python/sglang/srt/models/deepseek_v4_dspark.py (new, 261 lines): the DSpark draft model (three-stage MTP backbone, Markov head, confidence head).
  • python/sglang/srt/speculative/dspark_worker_v2.py (new, 573 lines): block draft, target verify, longest greedy-matching prefix plus bonus token.
  • python/sglang/srt/speculative/dspark_info.py (new, 328 lines): DSparkDraftInputV2 and DSparkVerifyInput with filter/merge for batching and the overlap scheduler.
  • python/sglang/srt/models/deepseek_v4.py (+102): target hidden-state capture seam and the per-layer draft KV materialization (MQALayer.kv_from_hidden). Draft and target share the embedding and lm_head; the checkpoint has no draft-private copies.

Registration and dispatch

  • python/sglang/srt/speculative/spec_info.py (+37): registers SpeculativeAlgorithm.DSPARK, the is_dspark() predicate, and the DSpark draft/verify input types.
  • python/sglang/srt/speculative/spec_registry.py (+3): conformance stub.
  • python/sglang/srt/speculative/spec_utils.py (+5): skip the hidden-states placeholder indexing for DSpark under the overlap scheduler.
  • python/sglang/srt/arg_groups/speculative_hook.py (+101): _handle_dspark (pins topk and num-steps to 1, defaults the draft path to the target, validates block size and confidence threshold).
  • python/sglang/srt/configs/model_config.py (+18), configs/deepseek_v4.py (+5), arg_groups/deepseek_v4_hook.py: draft architecture swap and config wiring.
  • python/sglang/srt/model_executor/model_runner.py (+42), model_executor/pool_configurator.py (+21), model_executor/runner/{decode,prefill}_cuda_graph_runner.py, layers/attention/{deepseek_v4_backend,flashinfer_backend}.py, managers/overlap_utils.py: dispatch sites, draft KV pool sizing, and CUDA graph capture.
  • python/sglang/srt/server_args.py (+8): the DSpark CLI flags.

Configuration parameters

  • --speculative-algorithm DSPARK: enable DSpark. The draft path defaults to the target checkpoint.
  • --speculative-dspark-block-size (default 5): draft block length; alias of --speculative-num-draft-tokens for DSpark.
  • --speculative-dspark-confidence-threshold (default 0.0, range [0, 1]): truncate the draft block at the first position whose confidence-head probability falls below the threshold; 0 verifies the full block.
  • --speculative-eagle-topk and --speculative-num-steps are pinned to 1.

Documentation

  • docs/advanced_features/dspark_speculative_decoding.md (new): user guide (how it works, launch command, request example, parameters, confidence-threshold tuning, accuracy and performance, constraints).
  • docs/advanced_features/speculative_decoding.md: DSpark added to the quick-guidance list and the method comparison table.
  • docs/advanced_features/server_arguments.md: DSPARK added to the algorithm options plus the two DSpark flags.
  • docs/index.rst: the new doc wired into the Advanced Features toctree.

Tests

  • test/registered/unit/spec/test_dspark.py (new, 187 lines): predicate truth table, _handle_dspark arg handling (block-size alias conflict, bounds, dp-attention and pipeline-parallel rejection, draft-path defaulting), and a DSparkDraftInputV2 filter/merge regression (the asymmetric merge that previously left a stale-sized array for filter_batch to index out of bounds, plus the overlap placeholder fields). Registered on base-a-test-cpu.

Accuracy Tests

Greedy (temperature 0), 200 examples each, DeepSeek-V4-Flash-DSpark on 8x B200. Accuracy matches the no-speculation target within run-to-run noise (FP4 MoE is not byte-exact across runs, so greedy accuracy match is the correctness bar):

Eval No spec DSpark
GSM8K 0.985 0.990
MMLU 0.900 0.895

MMLU no-spec category breakdown: stem 0.952, humanities 0.839, social sciences 0.918, other 0.915.

Benchmarking and Profiling

Configuration

  • Model: deepseek-ai/DeepSeek-V4-Flash-DSpark (284B, FP4 MoE)
  • Hardware: 8x B200, tp8 ep8
  • MoE backend: flashinfer_mxfp4 (target and draft)
  • Sampling: greedy (temperature 0)
  • Acceptance length: mean accepted tokens per target forward, bonus included; no spec equals 1.00
  • The no-spec baseline runs at --mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16; without a draft model it does not fit at the DSpark memory settings, which is also why the no-spec sweep stops at concurrency 16. The single-stream and low-concurrency rows are unaffected; treat the concurrency 8 and 16 rows as indicative.

Commands

Launch (DSpark):

python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V4-Flash-DSpark \
    --trust-remote-code --tp 8 --ep-size 8 \
    --moe-runner-backend flashinfer_mxfp4 \
    --speculative-moe-runner-backend flashinfer_mxfp4 \
    --speculative-algorithm DSPARK \
    --speculative-eagle-topk 1 --speculative-num-steps 1 \
    --mem-fraction-static 0.85 --context-length 4096 \
    --cuda-graph-max-bs 32 --max-running-requests 32 \
    --disable-overlap-schedule

Launch (no spec baseline): the same command without the --speculative-* flags and --disable-overlap-schedule, with --mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16.

Accuracy:

python3 -m sglang.test.run_eval --port 30000 --eval-name gsm8k --num-examples 200
python3 -m sglang.test.run_eval --port 30000 --eval-name mmlu --num-examples 200

Throughput and latency sweep (per concurrency cc):

python3 -m sglang.bench_serving --backend sglang --port 30000 \
    --dataset-name random --random-input-len 1024 --random-output-len 1024 \
    --random-range-ratio 1.0 --num-prompts $((cc*6)) \
    --max-concurrency $cc --request-rate inf

Acceptance length by domain: averaged the per-request meta_info.spec_accept_length returned by the server over 40 greedy prompts per dataset.

Results

Acceptance length by domain (tokens per target forward, bonus included; no spec equals 1.00):

GSM8K MATH500 HumanEval MBPP MT-Bench Alpaca
DSpark 3.21 3.53 3.61 3.43 2.91 3.09

Per-domain mean: math 3.37, code 3.52, chat 3.00.

Single-stream speedup (same server, with and without DSpark):

Input / output No spec (tok/s) DSpark (tok/s) Speedup
512 / 256 154 228 1.48x
1024 / 1024 164 297 1.81x

Throughput and latency across concurrency (random 1024/1024, request-rate inf):

Concurrency No spec (tok/s) DSpark (tok/s) Speedup DSpark median TPOT (ms) DSpark accept len
1 164.7 297.6 1.81x 2.93 3.41
4 585.6 607.8 1.04x 4.65 3.54
8 608.4 735.1 1.21x 6.75 3.74
16 601.8 788.3 1.31x 5.11 4.01
32 n/a 761.3 n/a 5.16 4.25

DSpark gives the largest gain on single-stream and low-to-mid concurrency. As the batch fills and decode becomes compute-bound, the no-spec baseline plateaus around 600 tok/s while DSpark keeps climbing to roughly 790 tok/s at concurrency 16.

Raw benchmark output

Accuracy (sglang.test.run_eval, 200 examples, greedy):

GSM8K  no-spec : Score: 0.985
GSM8K  DSpark  : Score: 0.990
MMLU   no-spec : Score: 0.900   (stem 0.952, humanities 0.839, social_sciences 0.918, other 0.915)
MMLU   DSpark  : Score: 0.895

Acceptance length by domain (DSpark, greedy, n=40 per dataset):

gsm8k     [math]  3.214
math500   [math]  3.531
humaneval [code]  3.608
mbpp      [code]  3.431
mtbench   [chat]  2.913
alpaca    [chat]  3.094
per-domain: math 3.373   code 3.520   chat 3.004

Throughput sweep (sglang.bench_serving, random 1024/1024, request-rate inf):

DSpark   c=1   297.6 tok/s   median TPOT 2.93 ms   accept 3.41
DSpark   c=4   607.8 tok/s   median TPOT 4.65 ms   accept 3.54
DSpark   c=8   735.1 tok/s   median TPOT 6.75 ms   accept 3.74
DSpark   c=16  788.3 tok/s   median TPOT 5.11 ms   accept 4.01
DSpark   c=32  761.3 tok/s   median TPOT 5.16 ms   accept 4.25
no-spec  c=1   164.7 tok/s   median TPOT 6.00 ms
no-spec  c=4   585.6 tok/s   median TPOT 6.68 ms
no-spec  c=8   608.4 tok/s   median TPOT 7.24 ms
no-spec  c=16  601.8 tok/s   median TPOT 7.26 ms

Scope

Greedy (temperature 0) verification. Requests with temperature > 0 are served with greedy verification and log a one-time warning; rejection sampling on the draft distribution (which the Markov-argmax drafter does not expose yet) is a follow-up. The throughput sweep was run with --disable-overlap-schedule; the draft-input structures carry the overlap-scheduler fields, and validating overlap-on throughput is a follow-up.

Checklist


CI States

Latest PR Test (Base): ❌ Run #28309481835
Latest PR Test (Extra): ❌ Run #28309481732

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 28, 2026
@adityakamat24 adityakamat24 changed the title [Spec] Add DSpark block speculative decoding for DeepSeek-V4 [Spec] Add DSpark speculative decoding for DeepSeek-V4 Jun 28, 2026
@whybeyoung

Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support DSpark Speculative Decoding for DeepSeek V4

3 participants