[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538
Open
adityakamat24 wants to merge 2 commits into
Open
[Spec] Add DSpark speculative decoding for DeepSeek-V4#29538adityakamat24 wants to merge 2 commits into
adityakamat24 wants to merge 2 commits into
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2 tasks
Collaborator
|
/tag-and-rerun-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #29488.
Motivation
DeepSeek shipped
DeepSeek-V4-Flash-DSparkandDeepSeek-V4-Pro-DSpark: the V4 checkpoints with a built-in speculative module ("DSpark") under themtp.*namespace. SGLang already serves DeepSeek-V4 and already has block-draft speculative decoding, but not the DSpark drafter. This PR adds it.DSpark is a block drafter. Each decode step it proposes a block of
block_sizetokens (default 5) in one draft forward, and the target verifies the whole block in one pass. The drafter is a three-stage MTP stack with a Markov head (refines the block autoregressively) and a confidence head (scores each position). It runs off the target's fused hidden states rather than its own token KV, so there is no separate draft model to download and the extra memory is small. Verification is greedy and lossless: the accepted output is identical to the target's greedy decoding for any block size or confidence threshold.Modifications
Core implementation
python/sglang/srt/models/deepseek_v4_dspark.py(new, 261 lines): the DSpark draft model (three-stage MTP backbone, Markov head, confidence head).python/sglang/srt/speculative/dspark_worker_v2.py(new, 573 lines): block draft, target verify, longest greedy-matching prefix plus bonus token.python/sglang/srt/speculative/dspark_info.py(new, 328 lines):DSparkDraftInputV2andDSparkVerifyInputwith filter/merge for batching and the overlap scheduler.python/sglang/srt/models/deepseek_v4.py(+102): target hidden-state capture seam and the per-layer draft KV materialization (MQALayer.kv_from_hidden). Draft and target share the embedding and lm_head; the checkpoint has no draft-private copies.Registration and dispatch
python/sglang/srt/speculative/spec_info.py(+37): registersSpeculativeAlgorithm.DSPARK, theis_dspark()predicate, and the DSpark draft/verify input types.python/sglang/srt/speculative/spec_registry.py(+3): conformance stub.python/sglang/srt/speculative/spec_utils.py(+5): skip the hidden-states placeholder indexing for DSpark under the overlap scheduler.python/sglang/srt/arg_groups/speculative_hook.py(+101):_handle_dspark(pins topk and num-steps to 1, defaults the draft path to the target, validates block size and confidence threshold).python/sglang/srt/configs/model_config.py(+18),configs/deepseek_v4.py(+5),arg_groups/deepseek_v4_hook.py: draft architecture swap and config wiring.python/sglang/srt/model_executor/model_runner.py(+42),model_executor/pool_configurator.py(+21),model_executor/runner/{decode,prefill}_cuda_graph_runner.py,layers/attention/{deepseek_v4_backend,flashinfer_backend}.py,managers/overlap_utils.py: dispatch sites, draft KV pool sizing, and CUDA graph capture.python/sglang/srt/server_args.py(+8): the DSpark CLI flags.Configuration parameters
--speculative-algorithm DSPARK: enable DSpark. The draft path defaults to the target checkpoint.--speculative-dspark-block-size(default 5): draft block length; alias of--speculative-num-draft-tokensfor DSpark.--speculative-dspark-confidence-threshold(default 0.0, range [0, 1]): truncate the draft block at the first position whose confidence-head probability falls below the threshold; 0 verifies the full block.--speculative-eagle-topkand--speculative-num-stepsare pinned to 1.Documentation
docs/advanced_features/dspark_speculative_decoding.md(new): user guide (how it works, launch command, request example, parameters, confidence-threshold tuning, accuracy and performance, constraints).docs/advanced_features/speculative_decoding.md: DSpark added to the quick-guidance list and the method comparison table.docs/advanced_features/server_arguments.md: DSPARK added to the algorithm options plus the two DSpark flags.docs/index.rst: the new doc wired into the Advanced Features toctree.Tests
test/registered/unit/spec/test_dspark.py(new, 187 lines): predicate truth table,_handle_dsparkarg handling (block-size alias conflict, bounds, dp-attention and pipeline-parallel rejection, draft-path defaulting), and aDSparkDraftInputV2filter/merge regression (the asymmetric merge that previously left a stale-sized array forfilter_batchto index out of bounds, plus the overlap placeholder fields). Registered onbase-a-test-cpu.Accuracy Tests
Greedy (temperature 0), 200 examples each,
DeepSeek-V4-Flash-DSparkon 8x B200. Accuracy matches the no-speculation target within run-to-run noise (FP4 MoE is not byte-exact across runs, so greedy accuracy match is the correctness bar):MMLU no-spec category breakdown: stem 0.952, humanities 0.839, social sciences 0.918, other 0.915.
Benchmarking and Profiling
Configuration
deepseek-ai/DeepSeek-V4-Flash-DSpark(284B, FP4 MoE)flashinfer_mxfp4(target and draft)--mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16; without a draft model it does not fit at the DSpark memory settings, which is also why the no-spec sweep stops at concurrency 16. The single-stream and low-concurrency rows are unaffected; treat the concurrency 8 and 16 rows as indicative.Commands
Launch (DSpark):
python3 -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V4-Flash-DSpark \ --trust-remote-code --tp 8 --ep-size 8 \ --moe-runner-backend flashinfer_mxfp4 \ --speculative-moe-runner-backend flashinfer_mxfp4 \ --speculative-algorithm DSPARK \ --speculative-eagle-topk 1 --speculative-num-steps 1 \ --mem-fraction-static 0.85 --context-length 4096 \ --cuda-graph-max-bs 32 --max-running-requests 32 \ --disable-overlap-scheduleLaunch (no spec baseline): the same command without the
--speculative-*flags and--disable-overlap-schedule, with--mem-fraction-static 0.78 --max-running-requests 16 --cuda-graph-max-bs 16.Accuracy:
Throughput and latency sweep (per concurrency
cc):python3 -m sglang.bench_serving --backend sglang --port 30000 \ --dataset-name random --random-input-len 1024 --random-output-len 1024 \ --random-range-ratio 1.0 --num-prompts $((cc*6)) \ --max-concurrency $cc --request-rate infAcceptance length by domain: averaged the per-request
meta_info.spec_accept_lengthreturned by the server over 40 greedy prompts per dataset.Results
Acceptance length by domain (tokens per target forward, bonus included; no spec equals 1.00):
Per-domain mean: math 3.37, code 3.52, chat 3.00.
Single-stream speedup (same server, with and without DSpark):
Throughput and latency across concurrency (random 1024/1024, request-rate inf):
DSpark gives the largest gain on single-stream and low-to-mid concurrency. As the batch fills and decode becomes compute-bound, the no-spec baseline plateaus around 600 tok/s while DSpark keeps climbing to roughly 790 tok/s at concurrency 16.
Raw benchmark output
Accuracy (
sglang.test.run_eval, 200 examples, greedy):Acceptance length by domain (DSpark, greedy, n=40 per dataset):
Throughput sweep (
sglang.bench_serving, random 1024/1024, request-rate inf):Scope
Greedy (temperature 0) verification. Requests with temperature > 0 are served with greedy verification and log a one-time warning; rejection sampling on the draft distribution (which the Markov-argmax drafter does not expose yet) is a follow-up. The throughput sweep was run with
--disable-overlap-schedule; the draft-input structures carry the overlap-scheduler fields, and validating overlap-on throughput is a follow-up.Checklist
CI States
Latest PR Test (Base): ❌ Run #28309481835
Latest PR Test (Extra): ❌ Run #28309481732