PD streaming: batch notify + SSE fast path by inkcherry · Pull Request #22658 · sgl-project/sglang

inkcherry · 2026-04-13T04:25:41Z

Motivation

cc @ZhaiFeiyue @Duyi-Wang
Under high-concurrency PD disaggregation streaming (e.g., 2048 concurrent requests), the decode-side tokenizer_manager becomes a CPU bottleneck due to two issues:

Asyncio wakeup storms: Each event.set() in _handle_batch_output immediately wakes an asyncio coroutine. With hundreds of requests per decode batch, this causes excessive context switching.
Per-token Pydantic overhead: Every SSE streaming chunk constructs 3 Pydantic objects (DeltaMessage, ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse) and calls model_dump_json(), which involves schema validation, field traversal, and recursive serialization — unnecessary for a fixed-structure streaming chunk.

Both optimizations target the post-decode CPU path only (tokenizer manager + API entrypoint), improving throughput and TPOT without increasing ITL.

Changes

1. Batch Notify (tokenizer_manager.py)

Change _handle_batch_output from sync to async
Instead of calling state.event.set() per request, collect pending notifications and flush in groups of 16 with await asyncio.sleep(0) yield points
Final flush after the loop ensures no notifications are lost

2. SSE Fast Path (serving_chat.py)

Add _fast_sse_content() helper that constructs plain Python dicts and uses orjson.dumps() instead of Pydantic model_dump_json()
Replace Pydantic serialization in the 4 hot-path streaming yield points: first chunk (role), reasoning content, regular content, and finish_reason
Non-hot-path yields (hidden_states, routed_experts, final usage, tool calls) remain unchanged

Note on IPC serialization

During profiling we also identified pickle serialization on the detokenizer→tokenizer IPC path as a major bottleneck. We noticed that #21643 is already in progress to migrate this from pickle to msgpack/msgspec, which we believe will provide significant additional throughput gains for high-concurrency PD disagg workloads.

Test Setup

Hardware: 2-node PD disagg (1× prefill MI355, 1× decode MI355)
Model: DeepSeek-R1-0528-MXFP4
Benchmark: 10,240 prompts, random input/output len 1024 (range ratio 0.8), max-concurrency 2048, 2,048 warmup requests

Prefill server (node 1):

python3 -m sglang.launch_server \
    --model-path DeepSeek-R1-0528-MXFP4 \
    --disaggregation-mode prefill \
    --tp-size 4 --ep-size 4 --dp-size 4 \
    --max-running-requests 256 \
    --chunked-prefill-size 49152 \
    --kv-cache-dtype fp8_e4m3 \
    --attention-backend aiter \
    --enable-dp-attention \
    --disable-radix-cache

Decode server (node 2):

python3 -m sglang.launch_server \
    --model-path DeepSeek-R1-0528-MXFP4 \
    --disaggregation-mode decode \
    --tp-size 8 --ep-size 8 --dp-size 8 \
    --max-running-requests 2048 \
    --kv-cache-dtype fp8_e4m3 \
    --attention-backend aiter \
    --enable-dp-attention \
    --cuda-graph-bs 1..560

Router:

python3 -m sglang_router.launch_router \
    --pd-disaggregation --port 30000 \
    --policy random --prefill-policy random --decode-policy random \
    --prefill http://<prefill_ip>:8000 --decode http://<decode_ip>:8000

Benchmark:

python3 benchmark_serving.py \
    --backend openai --base-url http://0.0.0.0:30000 \
    --model DeepSeek-R1-0528-MXFP4 \
    --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.8 \
    --num-prompts 10240 --max-concurrency 2048 --request-rate inf \
    --ignore-eos --num-warmups 2048

Results

Interleaved validation (baseline and this PR alternated across rounds to control for thermal drift):

Round	Config	Output tok/s	Mean TPOT (ms)	Mean ITL (ms)
1	Baseline	11,533	111.67	111.53
2	This PR	13,929	91.93	91.84
3	Baseline	11,443	109.81	109.68
4	This PR	13,674	93.30	93.21

Output throughput: 11,488 → 13,802 tok/s (+20.1%), Mean TPOT: 110.74 → 92.62 ms (-16.4%), Mean ITL: 110.61 → 92.53 ms (-16.4%)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

P1 (Batch Notify): Batch event.set() calls in groups of 16 with asyncio.sleep(0) yield points to reduce asyncio wakeup storms under high-concurrency PD disagg streaming. A1 (SSE Fast Path): Replace Pydantic model_dump_json() with direct dict construction + orjson.dumps() in the SSE streaming hot path, eliminating per-chunk Pydantic overhead.

gemini-code-assist · 2026-04-13T04:25:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-04-13T08:48:49Z

/tag-and-rerun-ci

HaiShaw · 2026-04-13T08:52:41Z

@hnyls2002 please help review.

inkcherry requested review from CatherineSue, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners April 13, 2026 04:25

Kangyan-Zhou requested a review from alexnails April 13, 2026 05:55

github-actions bot added the run-ci label Apr 13, 2026

HaiShaw assigned hnyls2002 Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD streaming: batch notify + SSE fast path#22658

PD streaming: batch notify + SSE fast path#22658
inkcherry wants to merge 1 commit intosgl-project:mainfrom
inkcherry:pd_streaming_opt

inkcherry commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

HaiShaw commented Apr 13, 2026

Uh oh!

HaiShaw commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

inkcherry commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Note on IPC serialization

Test Setup

Results

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

HaiShaw commented Apr 13, 2026

Uh oh!

HaiShaw commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inkcherry commented Apr 13, 2026 •

edited

Loading