[llm] Embedding api #52229

janimo · 2025-04-10T23:13:45Z

Why are these changes needed?

Expose an embedding API like https://platform.openai.com/docs/api-reference/embeddings using vLLM
It still needs tests.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

richardliaw · 2025-04-11T01:22:10Z

hey @janimo - awesome to see this! @GeneDer will review this tomorrow.

GeneDer

Good first draft! Have you tested manually and see it working? Maybe you can share a screenshot here. We should also add some unit test to all the new logics added in this PR.

Few things can be follow ups:

docs and examples
release tests using this end to end
telemetries of this feature

python/ray/llm/_internal/serve/configs/openai_api_models.py

python/ray/llm/_internal/serve/configs/server_models.py

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

GeneDer · 2025-04-11T18:27:04Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+    async def embed(
+        self, vllm_embedding_request: VLLMEmbeddingRequest
+    ) -> Tuple[List[List[float]], int]:  # Return (embeddings, num_prompt_tokens)
+        def floats_to_base64(float_list):


Let's don't do nested functions like this. Move it to https://github.com/ray-project/ray/blob/ea9c3038d56883b563f06837a69ddeb21ff2a78d/python/ray/llm/_internal/serve/deployments/utils/server_utils.py and also add type hints, docstring, and unit tests.

Moved it to server_utils. Where do unit tests for it belong?

Oh there's currently no test file for it. You can create a new file python/ray/llm/tests/serve/deployments/test_server_utils.py and add your tests

GeneDer · 2025-04-11T19:00:43Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+
+        generators: List[AsyncGenerator["PoolingRequestOutput", None]] = []
+
+        prompts = vllm_embedding_request.prompt


Nit (Non-blocker): not sure if it makes sense, but maybe some of those should be refactored into prompt format. If it's very different from the existing logics, can create a new method embedding_prompt or something and move all those directly there.

I did not understand this @GeneDer? What do you mean?

There's a whole block here trying to iterate and process vllm_embedding_request.prompt into TextPrompt that can be refactored into the prompt format. But again this is not a blocker for me, just a suggestion to better organize the code since we already have the prompt format object to do similar things

GeneDer · 2025-04-11T19:02:37Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_models.py

+
+class VLLMEmbeddingRequest(EmbeddingRequest):
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+    encoding_format: Optional[Literal["float", "base64"]] = "float"


Nit (non-blocker): would suggest make this an enum

janimo · 2025-04-11T21:16:13Z

import time
import ray
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

ray.init(num_gpus=1, num_cpus=2)

# A few possible models that support embeddings
qwen = dict(
    model_id="qwen-0.5b",
    model_source="Qwen/Qwen2.5-0.5B-Instruct",
)

roberta = dict(
    model_id="sentence-transformers/all-roberta-large-v1",
    model="sentence-transformers/all-roberta-large-v1",
)

baai = dict(model_id="BAAI/bge-base-en-v1.5", model="BAAI/bge-base-en-v1.5")

llm_config = LLMConfig(
    model_loading_config=baai,
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=0,
            max_replicas=1,
        ),
    ),
    engine_kwargs=dict(
        tensor_parallel_size=1,
        device="cpu",
        task="embed",
    ),
)

# Deploy the application
deployment = LLMServer.as_deployment(
    llm_config.get_serve_options(name_prefix="vLLM:")
).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

print("Ray is up")

while True:
    time.sleep(1)

Run this script to serve the model then connect to it for example using httpie

http POST localhost:8000/v1/embeddings model="BAAI/bge-base-en-v1.5" input:='["a simple text", "another text"]' encoding_format="float"

It gives the same results as plain vllm started with

vllm serve BAAI/bge-base-en-v1.5 --port 9000

Connect using the same command, just the different port

http POST localhost:9000/v1/embeddings model="BAAI/bge-base-en-v1.5" input:='["a simple text", "another text"]' encoding_format="float"

python/ray/llm/_internal/serve/configs/openai_api_models.py

python/ray/llm/_internal/serve/deployments/llm/llm_server.py

python/ray/llm/_internal/serve/deployments/utils/server_utils.py

kouroshHakha

Great stuff. Thanks @janimo. Just waiting for unittests. If we can also add documentation to this PR as well it would be really great. The documentation goes into https://github.com/ray-project/ray/blob/master/doc/source/serve/llm/serving-llms.rst maybe as a new advanced use cases for now. For follow ups we need release tests (cna be on cpu) and telemetry. The release tests would go in https://github.com/ray-project/ray/blob/master/release/llm_tests/serve/ directory. We can basically add some new conditioned embedding tests to the probes that only activate when the underlying model is embedding model (discoverable through some meta data emitted by the server).

kouroshHakha · 2025-04-15T01:54:11Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

@@ -675,6 +677,48 @@ def _handle_input_too_long(
                len(request_output.prompt_token_ids), self.model_config.max_model_len
            ).exception

+    async def embed(
+        self, vllm_embedding_request: VLLMEmbeddingRequest
+    ) -> Tuple[List[List[float]], int]:  # Return (embeddings, num_prompt_tokens)


Can you add this comment as a standard docstring to clarify what it returns?

kouroshHakha · 2025-04-15T01:56:07Z

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

+
+        generators: List[AsyncGenerator["PoolingRequestOutput", None]] = []
+
+        prompts = vllm_embedding_request.prompt


I did not understand this @GeneDer? What do you mean?

Signed-off-by: Jani Monoses <[email protected]>

janimo requested a review from a team as a code owner April 10, 2025 23:13

janimo force-pushed the embedding-api branch from 392a730 to 14a5771 Compare April 10, 2025 23:17

richardliaw changed the title ~~Embedding api~~ [llm] Embedding api Apr 11, 2025

GeneDer reviewed Apr 11, 2025

View reviewed changes

janimo force-pushed the embedding-api branch from fe77032 to 2b6f9f5 Compare April 11, 2025 22:35

hainesmichaelc added the community-contribution Contributed by the community label Apr 11, 2025

GeneDer reviewed Apr 14, 2025

View reviewed changes

python/ray/llm/_internal/serve/configs/openai_api_models.py Outdated Show resolved Hide resolved

python/ray/llm/_internal/serve/deployments/llm/llm_server.py Show resolved Hide resolved

python/ray/llm/_internal/serve/deployments/utils/server_utils.py Outdated Show resolved Hide resolved

richardliaw added the llm label Apr 14, 2025

kouroshHakha reviewed Apr 15, 2025

View reviewed changes

kouroshHakha added the go add ONLY when ready to merge, run all tests label Apr 15, 2025

kouroshHakha assigned GeneDer Apr 15, 2025

kouroshHakha mentioned this pull request Apr 17, 2025

[llm] Roadmap for Data and Serve LLM APIs #51313

Open

22 tasks

janimo force-pushed the embedding-api branch 3 times, most recently from eb433f3 to 4f6388c Compare May 4, 2025 17:03

janimo requested review from edoakes, zcin, akshay-anyscale and a team as code owners May 4, 2025 17:03

Embedding API

2ce9d02

Signed-off-by: Jani Monoses <[email protected]>

janimo force-pushed the embedding-api branch from 4f6388c to 1d84057 Compare May 4, 2025 17:12

janimo added 2 commits May 5, 2025 02:05

Embedding docs.

088fadd

Signed-off-by: Jani Monoses <[email protected]>

unit tests for floats_to_base64

ec9a793

Signed-off-by: Jani Monoses <[email protected]>

janimo force-pushed the embedding-api branch from 1d84057 to ec9a793 Compare May 4, 2025 23:06

Add embedding API models to docs

8092b51

Signed-off-by: Jani Monoses <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm] Embedding api #52229

[llm] Embedding api #52229

janimo commented Apr 10, 2025 •

edited

Loading

richardliaw commented Apr 11, 2025

GeneDer left a comment

GeneDer Apr 11, 2025

janimo Apr 11, 2025

GeneDer Apr 15, 2025

GeneDer Apr 11, 2025

kouroshHakha Apr 15, 2025

GeneDer Apr 15, 2025

GeneDer Apr 11, 2025

janimo commented Apr 11, 2025

kouroshHakha left a comment

kouroshHakha Apr 15, 2025

kouroshHakha Apr 15, 2025


		generators: List[AsyncGenerator["PoolingRequestOutput", None]] = []

		prompts = vllm_embedding_request.prompt

[llm] Embedding api #52229

Are you sure you want to change the base?

[llm] Embedding api #52229

Conversation

janimo commented Apr 10, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

richardliaw commented Apr 11, 2025

GeneDer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janimo commented Apr 11, 2025

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janimo commented Apr 10, 2025 •

edited

Loading