Skip to content

[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented Apr 25, 2025

Fixes #52594.

LLMServer should be agnostic to the engine concepts (e.g. vllm in our case) but it's not. Moreover the LLM Engine does not have a standard interface. This PR standardizes it a bit.

These two changes allows us to change the implementation of VLLMEngine's prepare_request to follow that of vllm's server internal implementation. This allows for less divergence of logic between how vllm serve behaves vs. how serve llm apis behave regarding prompt formatting. There is still some potential diffs, but better than before.

release tests: https://buildkite.com/ray-project/release/builds/40000

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha kouroshHakha requested a review from a team as a code owner April 25, 2025 02:39
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Apr 25, 2025
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha kouroshHakha changed the title [WIP] Refactor LLMServer and LLMEngine [WIP] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic Apr 28, 2025
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha kouroshHakha changed the title [WIP] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic [Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic Apr 28, 2025
@kouroshHakha kouroshHakha requested a review from GeneDer April 28, 2025 18:58
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks and questions, but generally LGTM!

"""Wakes up the engine"""
pass

def shutdown(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those check_health, sleep, wakeup, and shutdown required to be implemented on the engine? I think check_health is, but not the other? If so, let's add abstract method decorator on those that need implementation, and can just remove the methods that are not relevant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I have kept them as placeholders. This is sort of based on my recent learnings from post-training side that would need to be able to call endpoints to sleep and wake up an engine. We should ultimately support that later through serve llm apis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, but can we at least add the abstract method decorator to check_health if that is required at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to make any of these hard reqs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can remain a no op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even check health? I feel we can at least return true as the implementation here. Just leave it as unimplemented and not forcing the child to implement it will be confusing for future dev who's reading this code... And also please make a note where/ how they will be used in the future as there are currently no usage for those.

"request_id": request_id,
"sampling_params": VLLMSamplingParams.from_prompt(prompt),
"disk_multiplex_config": disk_lora_model,
"serve_request_context": serve.context._serve_request_context.get(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not gonna work bc the engine is running in a different actor than the llm server. Would need to pass it from the llm server. But I can be wrong and maybe you have already tested and confirm this working??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I am not quite sure why we need to pass this through. Do you know? I just copied it over and did not realize that serve context could be different. Could just work because there is no ops on it? The release tests as well as my local tests worked. So I am wondering if this is not needed entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are tests for this (?) but this is needed for Serve's structure logging to work on the logs produced by the engine actor. The Serve logger is configured to take the attributes from Serve's request context. I think we just need to get it from the llm server and call .set on the engine actor then it should work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline let's remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still missed

serve_request_context: Optional[serve.context._RequestContext] = None

else:
disk_lora_model = None

prompt_output = self._llm_config.prompt_format.generate_prompt(prompt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done on a separate PR, but seems like we no longer need prompt_format anymore? Maybe add a TODO on the LLMConfig if not done together in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep added the todo already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where? 😅

prompt_text = prompt_output.text
image_input = prompt_output.image
image = []
if not self._llm_config.supports_vision and image_input:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes to this supports_vision

# Let vllm decide the content format.
given_format="auto",
tokenizer=self._tokenizer,
trust_remote_code=self.model_config.trust_remote_code,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this should come from self.engine_config. trust_remote_code to be more direct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be the same. model_config comes from engine side after it has started and has been compiled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, bc model_config came from the engine side, I feel we should use engine_config to be more direct since we control the implementation for it. model_config might not be used as intended and might not even have trust_remote_code implemented. (This is on the vllm's engine today, but future engine might not come with one at all)


async def generate(
self,
request: VLLMGenerationRequest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I feel it's necessary to distinguish VLLMGenerationRequest from GenerationRequest, maybe we should name the ones here vllm_request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function signature should remain identical to parent class. but generally agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes the signature should match. Then this should have be GenerationRequest type really...

kouroshHakha and others added 4 commits April 28, 2025 18:13
…e.py

Co-authored-by: Gene Der Su <[email protected]>
Signed-off-by: kourosh hakhamaneshi <[email protected]>
…e.py

Co-authored-by: Gene Der Su <[email protected]>
Signed-off-by: kourosh hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
…Hakha/ray into kh/fix-vlm-chat-template

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve.llm] The LLM serve apis don't work on some VLMs like OpenGVLab/InternVL2_5-1B-MPO
2 participants