-
Notifications
You must be signed in to change notification settings - Fork 698
Closed
Labels
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Problem we want to solve
We have servers with big GPUs, but want to serve only small LLMs and embeddings. That's the reason, why we want to deploy multiple models to a single GPU.
For this we want to to use Fractional GPU serving: https://docs.ray.io/en/latest/serve/llm/user-guides/fractional-gpu.html
Implementation
As explained in the linked document, we are using:
placement_group_config:
bundles:
- GPU: "0.40"
Expected behavior
Model is placed on a worker, that has at least 0.4 GPUs available.
Actual behavior
Autoscaler accepts only integer and returns the following error:
2026-01-27 23:22:45,038 ERROR (monitor) autoscaler.py:222 -- 0.4 is not of type 'integer'
Failed validating 'type' in schema['properties']['available_node_types']['patternProperties']['.*']['properties']['resources']['patternProperties']['.*']:
{'type': 'integer', 'minimum': 0}
Reproduction script
apiVersion: ray.io/v1
kind: RayService
metadata:
name: ray-serve
spec:
rayClusterConfig:
headGroupSpec:
...
workerGroupSpecs:
...
serveConfigV2: |
applications:
- name: embeddings
import_path: ray.serve.llm:build_openai_app
route_prefix: "/"
args:
llm_configs:
- model_loading_config:
model_id: gemma-300m
model_source: google/embeddinggemma-300m
engine_kwargs:
dtype: auto
max_model_len: 2048
gpu_memory_utilization: 0.40
enforce_eager: false
deployment_config:
num_replicas: 1
max_ongoing_requests: 256 # max requests per instance
placement_group_config:
bundles:
- GPU: "0.40"
runtime_env:
env_vars:
VLLM_USE_V1: "0"
VLLM_DISABLE_COMPILE_CACHE: "1"
Anything else
Please let me know, if you need further details.
Versions:
- vLLM: 0.12.0
- Ray: 2.53.0
- Kuberay Operator: 1.5.1
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Reactions are currently unavailable