[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

kouroshHakha · 2025-04-25T02:39:47Z

LLMServer should be agnostic to the engine concepts (e.g. vllm in our case) but it's not. Moreover the LLM Engine does not have a standard interface. This PR standardizes it a bit.

These two changes allows us to change the implementation of VLLMEngine's prepare_request to follow that of vllm's server internal implementation. This allows for less divergence of logic between how vllm serve behaves vs. how serve llm apis behave regarding prompt formatting. There is still some potential diffs, but better than before.

release tests: https://buildkite.com/ray-project/release/builds/40000 ✅

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

python/ray/llm/_internal/serve/configs/prompt_formats.py

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

GeneDer

Some nitpicks and questions, but generally LGTM!

GeneDer · 2025-04-28T23:38:48Z

python/ray/llm/_internal/serve/deployments/llm/llm_engine.py

+        """Wakes up the engine"""
+        pass
+
+    def shutdown(self):


Are those check_health, sleep, wakeup, and shutdown required to be implemented on the engine? I think check_health is, but not the other? If so, let's add abstract method decorator on those that need implementation, and can just remove the methods that are not relevant

For now I have kept them as placeholders. This is sort of based on my recent learnings from post-training side that would need to be able to call endpoints to sleep and wake up an engine. We should ultimately support that later through serve llm apis.

kk, but can we at least add the abstract method decorator to check_health if that is required at this point?

I don't want to make any of these hard reqs.

They can remain a no op.

even check health? I feel we can at least return true as the implementation here. Just leave it as unimplemented and not forcing the child to implement it will be confusing for future dev who's reading this code... And also please make a note where/ how they will be used in the future as there are currently no usage for those.

GeneDer · 2025-04-28T23:44:05Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+            "request_id": request_id,
+            "sampling_params": VLLMSamplingParams.from_prompt(prompt),
+            "disk_multiplex_config": disk_lora_model,
+            "serve_request_context": serve.context._serve_request_context.get(),


I think this is not gonna work bc the engine is running in a different actor than the llm server. Would need to pass it from the llm server. But I can be wrong and maybe you have already tested and confirm this working??

actually I am not quite sure why we need to pass this through. Do you know? I just copied it over and did not realize that serve context could be different. Could just work because there is no ops on it? The release tests as well as my local tests worked. So I am wondering if this is not needed entirely.

I don't think there are tests for this (?) but this is needed for Serve's structure logging to work on the logs produced by the engine actor. The Serve logger is configured to take the attributes from Serve's request context. I think we just need to get it from the llm server and call .set on the engine actor then it should work

discussed offline let's remove it.

still missed

ray/python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_models.py

Line 227 in f8534a5

serve_request_context: Optional[serve.context._RequestContext] = None

GeneDer · 2025-04-28T23:58:31Z

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

-            else:
-                disk_lora_model = None
-
-            prompt_output = self._llm_config.prompt_format.generate_prompt(prompt)


Can be done on a separate PR, but seems like we no longer need prompt_format anymore? Maybe add a TODO on the LLMConfig if not done together in this PR

yep added the todo already.

where? 😅

GeneDer · 2025-04-29T00:48:16Z

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

-            prompt_text = prompt_output.text
-            image_input = prompt_output.image
-            image = []
-            if not self._llm_config.supports_vision and image_input:


Same goes to this supports_vision

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

GeneDer · 2025-04-29T01:03:31Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+            # Let vllm decide the content format.
+            given_format="auto",
+            tokenizer=self._tokenizer,
+            trust_remote_code=self.model_config.trust_remote_code,


I feel this should come from self.engine_config. trust_remote_code to be more direct

They should be the same. model_config comes from engine side after it has started and has been compiled.

Yep, bc model_config came from the engine side, I feel we should use engine_config to be more direct since we control the implementation for it. model_config might not be used as intended and might not even have trust_remote_code implemented. (This is on the vllm's engine today, but future engine might not come with one at all)

GeneDer · 2025-04-29T01:07:19Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+
+    async def generate(
+        self,
+        request: VLLMGenerationRequest,


Nit: I feel it's necessary to distinguish VLLMGenerationRequest from GenerationRequest, maybe we should name the ones here vllm_request

I think the function signature should remain identical to parent class. but generally agree.

oh, yes the signature should match. Then this should have be GenerationRequest type really...

…e.py Co-authored-by: Gene Der Su <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…Hakha/ray into kh/fix-vlm-chat-template Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha added 3 commits April 24, 2025 19:26

refactor

b670ce5

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

28cd858

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

37d624d

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested a review from a team as a code owner April 25, 2025 02:39

kouroshHakha added the go add ONLY when ready to merge, run all tests label Apr 25, 2025

kouroshHakha added 2 commits April 24, 2025 20:26

wip

1799bf5

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

6594f34

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha changed the title ~~[WIP] Refactor LLMServer and LLMEngine~~ [WIP] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic Apr 28, 2025

wip

ff2bb07

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha commented Apr 28, 2025

View reviewed changes

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py Show resolved Hide resolved

wip

be35a11

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha changed the title ~~[WIP] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic~~ [Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic Apr 28, 2025

Merge branch 'master' into kh/fix-vlm-chat-template

e1045f0

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha commented Apr 28, 2025

View reviewed changes

python/ray/llm/_internal/serve/configs/prompt_formats.py Show resolved Hide resolved

kouroshHakha requested a review from GeneDer April 28, 2025 18:58

kouroshHakha added 3 commits April 28, 2025 15:26

fixed tests

9da0c2a

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed release tests

f5b4958

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

7ccd0d4

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

GeneDer reviewed Apr 29, 2025

View reviewed changes

kouroshHakha and others added 4 commits April 28, 2025 18:13

Update python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engin…

5804ba5

…e.py Co-authored-by: Gene Der Su <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

Update python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engin…

3101e81

…e.py Co-authored-by: Gene Der Su <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>

removed serve context stuff

b6eed94

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'kh/fix-vlm-chat-template' of https://github.com/kourosh…

2ed151e

…Hakha/ray into kh/fix-vlm-chat-template Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

kouroshHakha commented Apr 25, 2025 •

edited

Loading

GeneDer left a comment

GeneDer Apr 28, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

kouroshHakha Apr 29, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

GeneDer Apr 28, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

GeneDer Apr 28, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

GeneDer Apr 29, 2025

GeneDer Apr 29, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

GeneDer Apr 29, 2025

kouroshHakha Apr 29, 2025

GeneDer Apr 29, 2025

[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

Are you sure you want to change the base?

[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597

Conversation

kouroshHakha commented Apr 25, 2025 • edited Loading

GeneDer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha commented Apr 25, 2025 •

edited

Loading