Skip to content

Optimize TTFT: send first token immediately after prefill for streaming#701

Open
RishabhSaini wants to merge 3 commits intollm-d:mainfrom
RishabhSaini:streamPrefill
Open

Optimize TTFT: send first token immediately after prefill for streaming#701
RishabhSaini wants to merge 3 commits intollm-d:mainfrom
RishabhSaini:streamPrefill

Conversation

@RishabhSaini
Copy link
Copy Markdown

@RishabhSaini RishabhSaini commented Mar 10, 2026

Reduces Time To First Token (TTFT) for streaming clients by sending the first token immediately after prefill completes, before KV cache transfer to decode.

  • Convert non-streaming prefill response to SSE and forward first token to streaming clients
  • Fix token budget: decrement max_tokens by 1 for decode stage
  • Fix TTFT metrics to measure when first token actually reaches user (streaming vs non-streaming)
  • Strip internal kv_transfer_params from user-facing responses
  • Add headersSentWriter to prevent duplicate WriteHeader errors

@RishabhSaini
Copy link
Copy Markdown
Author

On H200s with 1P 1D (TP=2) GPT-OSS-120B model with always_pd_disagg_decider and on sanity_concurrent benchmark averaged across 3 runs for each:

Metric Baseline (runs 1-3) Optimized (runs 4-6) Change % Change
TTFT p50 50.24 ms 31.41 ms -18.83 ms -37.5%
TTFT p75 51.41 ms 32.57 ms -18.84 ms -36.6%
TTFT p99 65.00 ms 52.35 ms -12.65 ms -19.5%
TPOT p50 3.70 ms/token 3.99 ms/token +0.29 ms +7.8%
TPOT p75 3.71 ms/token 4.00 ms/token +0.29 ms +7.8%
TPOT p99 3.97 ms/token 5.96 ms/token +1.99 ms +50.1%

@RishabhSaini RishabhSaini force-pushed the streamPrefill branch 3 times, most recently from 8d4df8c to 528c60b Compare March 10, 2026 16:53
@RishabhSaini RishabhSaini requested a review from kfswain March 10, 2026 19:29
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com>
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com>
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com>
@vMaroon
Copy link
Copy Markdown
Member

vMaroon commented Mar 12, 2026

How does this affect the standard UX of first staring at a blank screen then getting a fast stream of tokens? does the time between the first and 2nd tokens match the average ITL? Or would the user experience the equivalent of "lag"?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants