-
Notifications
You must be signed in to change notification settings - Fork 6.2k
[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Serve.llm] Refactor LLMServer and LLMEngine to not diverge too much from vllm chat formatting logic #52597
Conversation
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks and questions, but generally LGTM!
"""Wakes up the engine""" | ||
pass | ||
|
||
def shutdown(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those check_health
, sleep
, wakeup
, and shutdown
required to be implemented on the engine? I think check_health
is, but not the other? If so, let's add abstract method decorator on those that need implementation, and can just remove the methods that are not relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I have kept them as placeholders. This is sort of based on my recent learnings from post-training side that would need to be able to call endpoints to sleep and wake up an engine. We should ultimately support that later through serve llm apis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, but can we at least add the abstract method decorator to check_health
if that is required at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to make any of these hard reqs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can remain a no op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even check health? I feel we can at least return true as the implementation here. Just leave it as unimplemented and not forcing the child to implement it will be confusing for future dev who's reading this code... And also please make a note where/ how they will be used in the future as there are currently no usage for those.
"request_id": request_id, | ||
"sampling_params": VLLMSamplingParams.from_prompt(prompt), | ||
"disk_multiplex_config": disk_lora_model, | ||
"serve_request_context": serve.context._serve_request_context.get(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not gonna work bc the engine is running in a different actor than the llm server. Would need to pass it from the llm server. But I can be wrong and maybe you have already tested and confirm this working??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I am not quite sure why we need to pass this through. Do you know? I just copied it over and did not realize that serve context could be different. Could just work because there is no ops on it? The release tests as well as my local tests worked. So I am wondering if this is not needed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there are tests for this (?) but this is needed for Serve's structure logging to work on the logs produced by the engine actor. The Serve logger is configured to take the attributes from Serve's request context. I think we just need to get it from the llm server and call .set on the engine actor then it should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still missed
serve_request_context: Optional[serve.context._RequestContext] = None |
else: | ||
disk_lora_model = None | ||
|
||
prompt_output = self._llm_config.prompt_format.generate_prompt(prompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done on a separate PR, but seems like we no longer need prompt_format
anymore? Maybe add a TODO on the LLMConfig
if not done together in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep added the todo already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where? 😅
prompt_text = prompt_output.text | ||
image_input = prompt_output.image | ||
image = [] | ||
if not self._llm_config.supports_vision and image_input: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes to this supports_vision
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py
Outdated
Show resolved
Hide resolved
# Let vllm decide the content format. | ||
given_format="auto", | ||
tokenizer=self._tokenizer, | ||
trust_remote_code=self.model_config.trust_remote_code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this should come from self.engine_config. trust_remote_code
to be more direct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be the same. model_config comes from engine side after it has started and has been compiled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, bc model_config
came from the engine side, I feel we should use engine_config
to be more direct since we control the implementation for it. model_config
might not be used as intended and might not even have trust_remote_code
implemented. (This is on the vllm's engine today, but future engine might not come with one at all)
|
||
async def generate( | ||
self, | ||
request: VLLMGenerationRequest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I feel it's necessary to distinguish VLLMGenerationRequest
from GenerationRequest
, maybe we should name the ones here vllm_request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the function signature should remain identical to parent class. but generally agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes the signature should match. Then this should have be GenerationRequest
type really...
…e.py Co-authored-by: Gene Der Su <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>
…e.py Co-authored-by: Gene Der Su <[email protected]> Signed-off-by: kourosh hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
…Hakha/ray into kh/fix-vlm-chat-template Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Fixes #52594.
LLMServer should be agnostic to the engine concepts (e.g. vllm in our case) but it's not. Moreover the LLM Engine does not have a standard interface. This PR standardizes it a bit.
These two changes allows us to change the implementation of VLLMEngine's prepare_request to follow that of vllm's server internal implementation. This allows for less divergence of logic between how
vllm serve
behaves vs. how serve llm apis behave regarding prompt formatting. There is still some potential diffs, but better than before.release tests: https://buildkite.com/ray-project/release/builds/40000 ✅