Support overlap scheduling for speculative decoding by timmy-feng · Pull Request #9588 · sgl-project/sglang

timmy-feng · 2025-08-25T09:08:37Z

Motivation

Speculative decoding currently does not support overlap scheduling due to the sequential logic between the draft and target models. However, overlap scheduling has been shown to achieve up to 10% performance gains in non-speculative use cases. This PR achieves host overlap in speculative decoding with 5-10% improvement at various batch sizes.

Feature parity is in the works.

To enable this experimental feature, the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 environment variable must be set. Additionally, using Flash Attention 3 is recommended as there is a sync in the Flash Infer backend.

Modifications

There should be no behavior change if the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE environment variable is not set.

Host Syncs

The following was done to remove host syncs:

Remove dynamic shapes from target verify by always padding accept length to spec_steps and adding padding handlers
Moved request finished / filtering checking to process_batch_result_decode and filter_batch respectively
Handle both allocation and freeing of pages on the scheduler (return of resolve_last_batch_result -> eviction mask to scheduler)
Overestimate seq_lens_cpu since the host can only know the exact sequence length from one step ago

Eagle Client

After removal of all syncs, EagleWorkerClient was implemented with:

A mock forward_speculative_batch_generation function which puts work on a queue for the forward thread
A FutureSpecInfo class which contains future buffers corresponding to each tensor in EagleDraftInput
forward_thread_func_ and resoluve_last_batch_result which mirror their counterparts in tp_worker_overlap_thread.py

Future Work

We hope these items can be addressed in future PR's.

page_size > 1 exists in this branch of work which supports paged attention for all backends other than fa3
Support return logprobs
Support grammar (likely with syncs)
Support for P/D disaggregation -- we briefly implemented a proof-of-concept to prove P/D, overlap scheduling, and speculative decoding can work together.
Reduce code duplication between eagle_worker.py and eagle_worker_for_overlap_scheduer.py. We separated these two files for now to reduce the risk of breaking changes.

Accuracy Tests

I ran GSM8K on an H100.

# Main
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-8B --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --attention-backend fa3 --mem-fraction-static 0.7 --dtype bfloat16 --port 30000
python benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.955
Invalid: 0.000
Latency: 12.379 s
Output throughput: 1922.860 token/s

# This branch
Accuracy: 0.950
Invalid: 0.000
Latency: 11.918 s
Output throughput: 2039.966 token/s

Benchmarking and Profiling

Benchmarks were run on an H200.

Before

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  121.97    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.64      
Input token throughput (tok/s):          526.41    
Output token throughput (tok/s):         352.20    
Total token throughput (tok/s):          878.60    
Concurrency:                             1.00      
Accept length:                           3.48      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.39    
Median E2E Latency (ms):                 464.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          24.10     
Median TTFT (ms):                        21.63     
P99 TTFT (ms):                           70.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.74      
Median ITL (ms):                         2.04      
P95 ITL (ms):                            4.98      
P99 ITL (ms):                            9.74      
Max ITL (ms):                            17.18     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  40.23     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              4.97      
Input token throughput (tok/s):          1596.02   
Output token throughput (tok/s):         1067.83   
Total token throughput (tok/s):          2663.85   
Concurrency:                             3.94      
Accept length:                           3.46      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   792.78    
Median E2E Latency (ms):                 593.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          31.66     
Median TTFT (ms):                        24.22     
P99 TTFT (ms):                           158.63    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.56      
Median ITL (ms):                         2.82      
P95 ITL (ms):                            8.44      
P99 ITL (ms):                            11.96     
Max ITL (ms):                            267.40    
==================================================

After

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  112.04    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.79      
Input token throughput (tok/s):          573.05    
Output token throughput (tok/s):         383.40    
Total token throughput (tok/s):          956.45    
Concurrency:                             1.00      
Accept length:                           3.54      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   559.91    
Median E2E Latency (ms):                 431.09    
---------------Time to First Token----------------
Mean TTFT (ms):                          30.63     
Median TTFT (ms):                        28.67     
P99 TTFT (ms):                           73.23     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.48      
Median ITL (ms):                         1.83      
P95 ITL (ms):                            4.48      
P99 ITL (ms):                            8.80      
Max ITL (ms):                            10.16     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  36.16     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42955     
Request throughput (req/s):              5.53      
Input token throughput (tok/s):          1775.41   
Output token throughput (tok/s):         1187.86   
Total token throughput (tok/s):          2963.27   
Concurrency:                             3.95      
Accept length:                           3.57      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   713.59    
Median E2E Latency (ms):                 529.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          39.49     
Median TTFT (ms):                        29.25     
P99 TTFT (ms):                           213.86    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.15      
Median ITL (ms):                         2.33      
P95 ITL (ms):                            7.01      
P99 ITL (ms):                            11.79     
Max ITL (ms):                            85.62     
==================================================

Repro Script

This script was run on an H200:

#! /bin/bash

# Start SGLang server
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 10 \
    --speculative-num-draft-tokens 32 \
    --attention-backend fa3 \
    --mem-fraction-static 0.9 \
    --dtype bfloat16 \
    --port 30000 &
PID=$!


# Wait for server to start
while ! curl -s http://localhost:30000/health > /dev/null; do
    sleep 1
done

# Run accuracy benchmark
python benchmark/gsm8k/bench_sglang.py --num-questions 200

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs1)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 1

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs4)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 4

# Kill server
kill $PID

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

zhyncs · 2025-08-26T21:16:32Z

Support grammar is the most important todo item

thecodingwizard · 2025-08-28T21:01:59Z

python/sglang/srt/speculative/eagle_utils.py

This method and the overlap_prepare_for_verify method were deliberately kept separate to reduce the risk of breaking existing code.

Co-authored-by: Nathan Wang <nathan.r.wang@gmail.com>

pythongiant · 2025-12-30T16:48:41Z

python/sglang/srt/managers/schedule_batch.py

+            # but we copy seq_lens in the scheduler's stream. This is a problem because seq_lens may
+            # not have been mutated by EagleWorkerClient before the scheduler stream starts making
+            # a copy of it. To avoid this, we synchronize all streams before copying seq_lens.
+            torch.cuda.synchronize()


Quick suggestion, is the torch.cuda.synchronize() here unintentionally serializing execution? It forces a full device barrier and can kill overlap throughput. Do you think something like this preserves overlap rather than stalling the whole GPU?

event = torch.cuda.Event(blocking=False) event.record(draft_stream) # later when TP worker needs results event.wait(tp_stream)

This way only the TP stream waits on draft completion rather than synchronizing the entire device

zhyncs assigned hnyls2002, ispobock and Qiaolin-Yu Aug 25, 2025

thecodingwizard force-pushed the eagle-overlap branch from f6446c8 to 16aac38 Compare August 26, 2025 03:01

zhyncs added the high priority label Aug 26, 2025

zhyncs added the speculative-decoding label Aug 28, 2025

thecodingwizard force-pushed the eagle-overlap branch 5 times, most recently from 3a41324 to 25dd3a7 Compare August 28, 2025 20:50

thecodingwizard marked this pull request as ready for review August 28, 2025 20:56

thecodingwizard requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kssteven418, kushanam, merrymercy, rkooo567, xiezhq-hermann and zhyncs as code owners August 28, 2025 20:56

thecodingwizard reviewed Aug 28, 2025

View reviewed changes

thecodingwizard force-pushed the eagle-overlap branch from 19a3975 to 2e4d830 Compare September 2, 2025 15:44

timmy-feng and others added 2 commits September 2, 2025 15:46

Support overlap scheduling for speculative decoding

5c527ff

Co-authored-by: Nathan Wang <nathan.r.wang@gmail.com>

Avoid duplicating eagle_utils.py

b3bec16

thecodingwizard force-pushed the eagle-overlap branch from 2e4d830 to b3bec16 Compare September 2, 2025 15:46

JustinTong0323 self-assigned this Sep 6, 2025

merrymercy requested a review from Fridge003 as a code owner November 29, 2025 07:06

pythongiant reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support overlap scheduling for speculative decoding#9588

Support overlap scheduling for speculative decoding#9588
timmy-feng wants to merge 2 commits intosgl-project:mainfrom
modal-labs:eagle-overlap

timmy-feng commented Aug 25, 2025 •

edited by thecodingwizard

Loading

Uh oh!

zhyncs commented Aug 26, 2025

Uh oh!

thecodingwizard Aug 28, 2025

Uh oh!

pythongiant Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

timmy-feng commented Aug 25, 2025 • edited by thecodingwizard Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Host Syncs

Eagle Client

Future Work

Accuracy Tests

Benchmarking and Profiling

Before

After

Repro Script

Checklist

Uh oh!

zhyncs commented Aug 26, 2025

Uh oh!

thecodingwizard Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

pythongiant Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

timmy-feng commented Aug 25, 2025 •

edited by thecodingwizard

Loading