Skip to content

[llm] Embedding api #52229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

[llm] Embedding api #52229

wants to merge 4 commits into from

Conversation

janimo
Copy link

@janimo janimo commented Apr 10, 2025

Why are these changes needed?

Expose an embedding API like https://platform.openai.com/docs/api-reference/embeddings using vLLM
It still needs tests.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@janimo janimo requested a review from a team as a code owner April 10, 2025 23:13
@richardliaw
Copy link
Contributor

hey @janimo - awesome to see this! @GeneDer will review this tomorrow.

@richardliaw richardliaw changed the title Embedding api [llm] Embedding api Apr 11, 2025
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first draft! Have you tested manually and see it working? Maybe you can share a screenshot here. We should also add some unit test to all the new logics added in this PR.

Few things can be follow ups:

  • docs and examples
  • release tests using this end to end
  • telemetries of this feature

async def embed(
self, vllm_embedding_request: VLLMEmbeddingRequest
) -> Tuple[List[List[float]], int]: # Return (embeddings, num_prompt_tokens)
def floats_to_base64(float_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's don't do nested functions like this. Move it to https://github.com/ray-project/ray/blob/ea9c3038d56883b563f06837a69ddeb21ff2a78d/python/ray/llm/_internal/serve/deployments/utils/server_utils.py and also add type hints, docstring, and unit tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to server_utils. Where do unit tests for it belong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh there's currently no test file for it. You can create a new file python/ray/llm/tests/serve/deployments/test_server_utils.py and add your tests


generators: List[AsyncGenerator["PoolingRequestOutput", None]] = []

prompts = vllm_embedding_request.prompt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (Non-blocker): not sure if it makes sense, but maybe some of those should be refactored into prompt format. If it's very different from the existing logics, can create a new method embedding_prompt or something and move all those directly there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this @GeneDer? What do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a whole block here trying to iterate and process vllm_embedding_request.prompt into TextPrompt that can be refactored into the prompt format. But again this is not a blocker for me, just a suggestion to better organize the code since we already have the prompt format object to do similar things


class VLLMEmbeddingRequest(EmbeddingRequest):
model_config = ConfigDict(arbitrary_types_allowed=True)
encoding_format: Optional[Literal["float", "base64"]] = "float"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocker): would suggest make this an enum

@janimo
Copy link
Author

janimo commented Apr 11, 2025

import time
import ray
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

ray.init(num_gpus=1, num_cpus=2)

# A few possible models that support embeddings
qwen = dict(
    model_id="qwen-0.5b",
    model_source="Qwen/Qwen2.5-0.5B-Instruct",
)

roberta = dict(
    model_id="sentence-transformers/all-roberta-large-v1",
    model="sentence-transformers/all-roberta-large-v1",
)

baai = dict(model_id="BAAI/bge-base-en-v1.5", model="BAAI/bge-base-en-v1.5")

llm_config = LLMConfig(
    model_loading_config=baai,
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=0,
            max_replicas=1,
        ),
    ),
    engine_kwargs=dict(
        tensor_parallel_size=1,
        device="cpu",
        task="embed",
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(
    llm_config.get_serve_options(name_prefix="vLLM:")
).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

print("Ray is up")

while True:
    time.sleep(1)

Run this script to serve the model then connect to it for example using httpie

http POST localhost:8000/v1/embeddings model="BAAI/bge-base-en-v1.5" input:='["a simple text", "another text"]' encoding_format="float"

It gives the same results as plain vllm started with

vllm serve BAAI/bge-base-en-v1.5 --port 9000

Connect using the same command, just the different port

http POST localhost:9000/v1/embeddings model="BAAI/bge-base-en-v1.5" input:='["a simple text", "another text"]' encoding_format="float"

@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 11, 2025
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff. Thanks @janimo. Just waiting for unittests. If we can also add documentation to this PR as well it would be really great. The documentation goes into https://github.com/ray-project/ray/blob/master/doc/source/serve/llm/serving-llms.rst maybe as a new advanced use cases for now. For follow ups we need release tests (cna be on cpu) and telemetry. The release tests would go in https://github.com/ray-project/ray/blob/master/release/llm_tests/serve/ directory. We can basically add some new conditioned embedding tests to the probes that only activate when the underlying model is embedding model (discoverable through some meta data emitted by the server).

@@ -675,6 +677,48 @@ def _handle_input_too_long(
len(request_output.prompt_token_ids), self.model_config.max_model_len
).exception

async def embed(
self, vllm_embedding_request: VLLMEmbeddingRequest
) -> Tuple[List[List[float]], int]: # Return (embeddings, num_prompt_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this comment as a standard docstring to clarify what it returns?


generators: List[AsyncGenerator["PoolingRequestOutput", None]] = []

prompts = vllm_embedding_request.prompt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this @GeneDer? What do you mean?

@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Apr 15, 2025
@janimo janimo force-pushed the embedding-api branch 3 times, most recently from eb433f3 to 4f6388c Compare May 4, 2025 17:03
@janimo janimo requested review from edoakes, zcin, akshay-anyscale and a team as code owners May 4, 2025 17:03
Signed-off-by: Jani Monoses <[email protected]>
janimo added 2 commits May 5, 2025 02:05
Signed-off-by: Jani Monoses <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants