-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[llm] Embedding api #52229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[llm] Embedding api #52229
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good first draft! Have you tested manually and see it working? Maybe you can share a screenshot here. We should also add some unit test to all the new logics added in this PR.
Few things can be follow ups:
- docs and examples
- release tests using this end to end
- telemetries of this feature
async def embed( | ||
self, vllm_embedding_request: VLLMEmbeddingRequest | ||
) -> Tuple[List[List[float]], int]: # Return (embeddings, num_prompt_tokens) | ||
def floats_to_base64(float_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's don't do nested functions like this. Move it to https://github.com/ray-project/ray/blob/ea9c3038d56883b563f06837a69ddeb21ff2a78d/python/ray/llm/_internal/serve/deployments/utils/server_utils.py and also add type hints, docstring, and unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to server_utils. Where do unit tests for it belong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh there's currently no test file for it. You can create a new file python/ray/llm/tests/serve/deployments/test_server_utils.py
and add your tests
|
||
generators: List[AsyncGenerator["PoolingRequestOutput", None]] = [] | ||
|
||
prompts = vllm_embedding_request.prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (Non-blocker): not sure if it makes sense, but maybe some of those should be refactored into prompt format. If it's very different from the existing logics, can create a new method embedding_prompt
or something and move all those directly there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand this @GeneDer? What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a whole block here trying to iterate and process vllm_embedding_request.prompt
into TextPrompt
that can be refactored into the prompt format. But again this is not a blocker for me, just a suggestion to better organize the code since we already have the prompt format object to do similar things
|
||
class VLLMEmbeddingRequest(EmbeddingRequest): | ||
model_config = ConfigDict(arbitrary_types_allowed=True) | ||
encoding_format: Optional[Literal["float", "base64"]] = "float" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (non-blocker): would suggest make this an enum
Run this script to serve the model then connect to it for example using httpie
It gives the same results as plain vllm started with
Connect using the same command, just the different port
|
python/ray/llm/_internal/serve/deployments/utils/server_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. Thanks @janimo. Just waiting for unittests. If we can also add documentation to this PR as well it would be really great. The documentation goes into https://github.com/ray-project/ray/blob/master/doc/source/serve/llm/serving-llms.rst maybe as a new advanced use cases for now. For follow ups we need release tests (cna be on cpu) and telemetry. The release tests would go in https://github.com/ray-project/ray/blob/master/release/llm_tests/serve/ directory. We can basically add some new conditioned embedding tests to the probes that only activate when the underlying model is embedding model (discoverable through some meta data emitted by the server).
@@ -675,6 +677,48 @@ def _handle_input_too_long( | |||
len(request_output.prompt_token_ids), self.model_config.max_model_len | |||
).exception | |||
|
|||
async def embed( | |||
self, vllm_embedding_request: VLLMEmbeddingRequest | |||
) -> Tuple[List[List[float]], int]: # Return (embeddings, num_prompt_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this comment as a standard docstring to clarify what it returns?
|
||
generators: List[AsyncGenerator["PoolingRequestOutput", None]] = [] | ||
|
||
prompts = vllm_embedding_request.prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand this @GeneDer? What do you mean?
eb433f3
to
4f6388c
Compare
Signed-off-by: Jani Monoses <[email protected]>
Signed-off-by: Jani Monoses <[email protected]>
Signed-off-by: Jani Monoses <[email protected]>
Signed-off-by: Jani Monoses <[email protected]>
Why are these changes needed?
Expose an embedding API like https://platform.openai.com/docs/api-reference/embeddings using vLLM
It still needs tests.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.