Skip to content

PD streaming: batch notify + SSE fast path#22658

Open
inkcherry wants to merge 1 commit intosgl-project:mainfrom
inkcherry:pd_streaming_opt
Open

PD streaming: batch notify + SSE fast path#22658
inkcherry wants to merge 1 commit intosgl-project:mainfrom
inkcherry:pd_streaming_opt

Conversation

@inkcherry
Copy link
Copy Markdown
Contributor

@inkcherry inkcherry commented Apr 13, 2026

Motivation

cc @ZhaiFeiyue @Duyi-Wang
Under high-concurrency PD disaggregation streaming (e.g., 2048 concurrent requests), the decode-side tokenizer_manager becomes a CPU bottleneck due to two issues:

  1. Asyncio wakeup storms: Each event.set() in _handle_batch_output immediately wakes an asyncio coroutine. With hundreds of requests per decode batch, this causes excessive context switching.
  2. Per-token Pydantic overhead: Every SSE streaming chunk constructs 3 Pydantic objects (DeltaMessage, ChatCompletionResponseStreamChoice, ChatCompletionStreamResponse) and calls model_dump_json(), which involves schema validation, field traversal, and recursive serialization — unnecessary for a fixed-structure streaming chunk.

Both optimizations target the post-decode CPU path only (tokenizer manager + API entrypoint), improving throughput and TPOT without increasing ITL.

Changes

1. Batch Notify (tokenizer_manager.py)

  • Change _handle_batch_output from sync to async
  • Instead of calling state.event.set() per request, collect pending notifications and flush in groups of 16 with await asyncio.sleep(0) yield points
  • Final flush after the loop ensures no notifications are lost

2. SSE Fast Path (serving_chat.py)

  • Add _fast_sse_content() helper that constructs plain Python dicts and uses orjson.dumps() instead of Pydantic model_dump_json()
  • Replace Pydantic serialization in the 4 hot-path streaming yield points: first chunk (role), reasoning content, regular content, and finish_reason
  • Non-hot-path yields (hidden_states, routed_experts, final usage, tool calls) remain unchanged

Note on IPC serialization

During profiling we also identified pickle serialization on the detokenizer→tokenizer IPC path as a major bottleneck. We noticed that #21643 is already in progress to migrate this from pickle to msgpack/msgspec, which we believe will provide significant additional throughput gains for high-concurrency PD disagg workloads.

Test Setup

Hardware: 2-node PD disagg (1× prefill MI355, 1× decode MI355)
Model: DeepSeek-R1-0528-MXFP4
Benchmark: 10,240 prompts, random input/output len 1024 (range ratio 0.8), max-concurrency 2048, 2,048 warmup requests

Prefill server (node 1):

python3 -m sglang.launch_server \
    --model-path DeepSeek-R1-0528-MXFP4 \
    --disaggregation-mode prefill \
    --tp-size 4 --ep-size 4 --dp-size 4 \
    --max-running-requests 256 \
    --chunked-prefill-size 49152 \
    --kv-cache-dtype fp8_e4m3 \
    --attention-backend aiter \
    --enable-dp-attention \
    --disable-radix-cache

Decode server (node 2):

python3 -m sglang.launch_server \
    --model-path DeepSeek-R1-0528-MXFP4 \
    --disaggregation-mode decode \
    --tp-size 8 --ep-size 8 --dp-size 8 \
    --max-running-requests 2048 \
    --kv-cache-dtype fp8_e4m3 \
    --attention-backend aiter \
    --enable-dp-attention \
    --cuda-graph-bs 1..560

Router:

python3 -m sglang_router.launch_router \
    --pd-disaggregation --port 30000 \
    --policy random --prefill-policy random --decode-policy random \
    --prefill http://<prefill_ip>:8000 --decode http://<decode_ip>:8000

Benchmark:

python3 benchmark_serving.py \
    --backend openai --base-url http://0.0.0.0:30000 \
    --model DeepSeek-R1-0528-MXFP4 \
    --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.8 \
    --num-prompts 10240 --max-concurrency 2048 --request-rate inf \
    --ignore-eos --num-warmups 2048

Results

Interleaved validation (baseline and this PR alternated across rounds to control for thermal drift):

Round Config Output tok/s Mean TPOT (ms) Mean ITL (ms)
1 Baseline 11,533 111.67 111.53
2 This PR 13,929 91.93 91.84
3 Baseline 11,443 109.81 109.68
4 This PR 13,674 93.30 93.21

Output throughput: 11,488 → 13,802 tok/s (+20.1%), Mean TPOT: 110.74 → 92.62 ms (-16.4%), Mean ITL: 110.61 → 92.53 ms (-16.4%)

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

P1 (Batch Notify): Batch event.set() calls in groups of 16 with
asyncio.sleep(0) yield points to reduce asyncio wakeup storms under
high-concurrency PD disagg streaming.

A1 (SSE Fast Path): Replace Pydantic model_dump_json() with direct
dict construction + orjson.dumps() in the SSE streaming hot path,
eliminating per-chunk Pydantic overhead.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou Kangyan-Zhou requested a review from alexnails April 13, 2026 05:55
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 13, 2026

/tag-and-rerun-ci

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 13, 2026

@hnyls2002 please help review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants