[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

kouroshHakha · 2025-05-03T19:11:22Z

This PR refactors the LLMServer and vLLMEngine so that streaming batching happens at the LLMServing layer instead of vLLMEngine.

I also noticed that if vLLMEngine does the batching then the LLMServer will unpack the batch and stream the individual items through the remote channel back to the router. This is a problem at high qps where ray's streaming becomes the botteleneck and the whole end to end latency will be impacted by the streaming issue. With proper batching we can mitigate this problem which was not happening before due to the extra unpacking done at the wrong layer. Unpacking should be done at the very last stage when router wants to send back the results to the http proxy.

TODO:

benchmark with this PR what the impact of batching would be
See if using pickle / custom serialization helps
Unittests for batching at llmserver

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: kouroshhakha <[email protected]>

wip

4c0ef53

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested a review from a team as a code owner May 3, 2025 19:11

kouroshHakha marked this pull request as draft May 3, 2025 19:11

refactored batching

9df2104

Signed-off-by: kouroshhakha <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

kouroshHakha commented May 3, 2025 •

edited

Loading

[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

Are you sure you want to change the base?

[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

Conversation

kouroshHakha commented May 3, 2025 • edited Loading

TODO:

kouroshHakha commented May 3, 2025 •

edited

Loading