Skip to content

[WIP] Investigating why batching streaming is not helping with the streaming overhead #52766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented May 3, 2025

This PR refactors the LLMServer and vLLMEngine so that streaming batching happens at the LLMServing layer instead of vLLMEngine.

I also noticed that if vLLMEngine does the batching then the LLMServer will unpack the batch and stream the individual items through the remote channel back to the router. This is a problem at high qps where ray's streaming becomes the botteleneck and the whole end to end latency will be impacted by the streaming issue. With proper batching we can mitigate this problem which was not happening before due to the extra unpacking done at the wrong layer. Unpacking should be done at the very last stage when router wants to send back the results to the http proxy.

TODO:

  • benchmark with this PR what the impact of batching would be
  • See if using pickle / custom serialization helps
  • Unittests for batching at llmserver

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 3, 2025 19:11
@kouroshHakha kouroshHakha marked this pull request as draft May 3, 2025 19:11
Signed-off-by: kouroshhakha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant