Skip to content

Support overlap scheduling for speculative decoding#9588

Open
timmy-feng wants to merge 2 commits intosgl-project:mainfrom
modal-labs:eagle-overlap
Open

Support overlap scheduling for speculative decoding#9588
timmy-feng wants to merge 2 commits intosgl-project:mainfrom
modal-labs:eagle-overlap

Conversation

@timmy-feng
Copy link
Copy Markdown
Collaborator

@timmy-feng timmy-feng commented Aug 25, 2025

Motivation

Speculative decoding currently does not support overlap scheduling due to the sequential logic between the draft and target models. However, overlap scheduling has been shown to achieve up to 10% performance gains in non-speculative use cases. This PR achieves host overlap in speculative decoding with 5-10% improvement at various batch sizes.

Feature parity is in the works.

To enable this experimental feature, the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 environment variable must be set. Additionally, using Flash Attention 3 is recommended as there is a sync in the Flash Infer backend.

Modifications

There should be no behavior change if the SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE environment variable is not set.

Host Syncs

The following was done to remove host syncs:

  • Remove dynamic shapes from target verify by always padding accept length to spec_steps and adding padding handlers
  • Moved request finished / filtering checking to process_batch_result_decode and filter_batch respectively
  • Handle both allocation and freeing of pages on the scheduler (return of resolve_last_batch_result -> eviction mask to scheduler)
  • Overestimate seq_lens_cpu since the host can only know the exact sequence length from one step ago

Eagle Client

After removal of all syncs, EagleWorkerClient was implemented with:

  • A mock forward_speculative_batch_generation function which puts work on a queue for the forward thread
  • A FutureSpecInfo class which contains future buffers corresponding to each tensor in EagleDraftInput
  • forward_thread_func_ and resoluve_last_batch_result which mirror their counterparts in tp_worker_overlap_thread.py

Future Work

We hope these items can be addressed in future PR's.

  • page_size > 1 exists in this branch of work which supports paged attention for all backends other than fa3
  • Support return logprobs
  • Support grammar (likely with syncs)
  • Support for P/D disaggregation -- we briefly implemented a proof-of-concept to prove P/D, overlap scheduling, and speculative decoding can work together.
  • Reduce code duplication between eagle_worker.py and eagle_worker_for_overlap_scheduer.py. We separated these two files for now to reduce the risk of breaking changes.

Accuracy Tests

I ran GSM8K on an H100.

# Main
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-8B --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 10 --speculative-num-draft-tokens 32 --attention-backend fa3 --mem-fraction-static 0.7 --dtype bfloat16 --port 30000
python benchmark/gsm8k/bench_sglang.py --num-questions 200
Accuracy: 0.955
Invalid: 0.000
Latency: 12.379 s
Output throughput: 1922.860 token/s

# This branch
Accuracy: 0.950
Invalid: 0.000
Latency: 11.918 s
Output throughput: 2039.966 token/s

Benchmarking and Profiling

Benchmarks were run on an H200.

Before

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  121.97    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.64      
Input token throughput (tok/s):          526.41    
Output token throughput (tok/s):         352.20    
Total token throughput (tok/s):          878.60    
Concurrency:                             1.00      
Accept length:                           3.48      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.39    
Median E2E Latency (ms):                 464.94    
---------------Time to First Token----------------
Mean TTFT (ms):                          24.10     
Median TTFT (ms):                        21.63     
P99 TTFT (ms):                           70.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.74      
Median ITL (ms):                         2.04      
P95 ITL (ms):                            4.98      
P99 ITL (ms):                            9.74      
Max ITL (ms):                            17.18     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  40.23     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              4.97      
Input token throughput (tok/s):          1596.02   
Output token throughput (tok/s):         1067.83   
Total token throughput (tok/s):          2663.85   
Concurrency:                             3.94      
Accept length:                           3.46      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   792.78    
Median E2E Latency (ms):                 593.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          31.66     
Median TTFT (ms):                        24.22     
P99 TTFT (ms):                           158.63    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.56      
Median ITL (ms):                         2.82      
P95 ITL (ms):                            8.44      
P99 ITL (ms):                            11.96     
Max ITL (ms):                            267.40    
==================================================

After

With concurrency 1:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     200       
Benchmark duration (s):                  112.04    
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42956     
Request throughput (req/s):              1.79      
Input token throughput (tok/s):          573.05    
Output token throughput (tok/s):         383.40    
Total token throughput (tok/s):          956.45    
Concurrency:                             1.00      
Accept length:                           3.54      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   559.91    
Median E2E Latency (ms):                 431.09    
---------------Time to First Token----------------
Mean TTFT (ms):                          30.63     
Median TTFT (ms):                        28.67     
P99 TTFT (ms):                           73.23     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.48      
Median ITL (ms):                         1.83      
P95 ITL (ms):                            4.48      
P99 ITL (ms):                            8.80      
Max ITL (ms):                            10.16     
==================================================

With concurrency 4:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     200       
Benchmark duration (s):                  36.16     
Total input tokens:                      64205     
Total generated tokens:                  42957     
Total generated tokens (retokenized):    42955     
Request throughput (req/s):              5.53      
Input token throughput (tok/s):          1775.41   
Output token throughput (tok/s):         1187.86   
Total token throughput (tok/s):          2963.27   
Concurrency:                             3.95      
Accept length:                           3.57      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   713.59    
Median E2E Latency (ms):                 529.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          39.49     
Median TTFT (ms):                        29.25     
P99 TTFT (ms):                           213.86    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.15      
Median ITL (ms):                         2.33      
P95 ITL (ms):                            7.01      
P99 ITL (ms):                            11.79     
Max ITL (ms):                            85.62     
==================================================

Repro Script

This script was run on an H200:

#! /bin/bash

# Start SGLang server
SGLANG_ENABLE_EXPERIMENTAL_EAGLE_OVERLAP_SCHEDULE=1 python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --speculative-draft-model-path Tengyunw/qwen3_8b_eagle3 \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 10 \
    --speculative-num-draft-tokens 32 \
    --attention-backend fa3 \
    --mem-fraction-static 0.9 \
    --dtype bfloat16 \
    --port 30000 &
PID=$!


# Wait for server to start
while ! curl -s http://localhost:30000/health > /dev/null; do
    sleep 1
done

# Run accuracy benchmark
python benchmark/gsm8k/bench_sglang.py --num-questions 200

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs1)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 1

# Flush cache
curl -X POST http://localhost:30000/flush_cache

# Run latency benchmark (bs4)
python -m sglang.bench_serving \
    --backend sglang \
    --num-prompts 200 \
    --max-concurrency 4

# Kill server
kill $PID

Checklist

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 26, 2025

Support grammar is the most important todo item

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and the overlap_prepare_for_verify method were deliberately kept separate to reduce the risk of breaking existing code.

timmy-feng and others added 2 commits September 2, 2025 15:46
# but we copy seq_lens in the scheduler's stream. This is a problem because seq_lens may
# not have been mutated by EagleWorkerClient before the scheduler stream starts making
# a copy of it. To avoid this, we synchronize all streams before copying seq_lens.
torch.cuda.synchronize()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick suggestion, is the torch.cuda.synchronize() here unintentionally serializing execution? It forces a full device barrier and can kill overlap throughput. Do you think something like this preserves overlap rather than stalling the whole GPU?

event = torch.cuda.Event(blocking=False)
event.record(draft_stream)

# later when TP worker needs results
event.wait(tp_stream)

This way only the TP stream waits on draft completion rather than synchronizing the entire device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants