Skip to content

[Bug] Support for fractional GPU serving #4447

@mathias-polarise

Description

@mathias-polarise

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Problem we want to solve

We have servers with big GPUs, but want to serve only small LLMs and embeddings. That's the reason, why we want to deploy multiple models to a single GPU.

For this we want to to use Fractional GPU serving: https://docs.ray.io/en/latest/serve/llm/user-guides/fractional-gpu.html

Implementation

As explained in the linked document, we are using:

          placement_group_config:
            bundles:
            - GPU: "0.40"

Expected behavior

Model is placed on a worker, that has at least 0.4 GPUs available.

Actual behavior

Autoscaler accepts only integer and returns the following error:

2026-01-27 23:22:45,038	ERROR (monitor) autoscaler.py:222 -- 0.4 is not of type 'integer'
Failed validating 'type' in schema['properties']['available_node_types']['patternProperties']['.*']['properties']['resources']['patternProperties']['.*']:
    {'type': 'integer', 'minimum': 0}

Reproduction script

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-serve
spec:
  rayClusterConfig:
    headGroupSpec:
      ...
    workerGroupSpecs:
      ...
  serveConfigV2: |
    applications:
    - name: embeddings
      import_path: ray.serve.llm:build_openai_app
      route_prefix: "/"
      args:
        llm_configs:
        - model_loading_config:
            model_id: gemma-300m
            model_source: google/embeddinggemma-300m
          engine_kwargs:
            dtype: auto
            max_model_len: 2048
            gpu_memory_utilization: 0.40
            enforce_eager: false
          deployment_config:
            num_replicas: 1
            max_ongoing_requests: 256 # max requests per instance
          placement_group_config:
            bundles:
            - GPU: "0.40"
          runtime_env:
            env_vars:
              VLLM_USE_V1: "0"
              VLLM_DISABLE_COMPILE_CACHE: "1"

Anything else

Please let me know, if you need further details.

Versions:

  • vLLM: 0.12.0
  • Ray: 2.53.0
  • Kuberay Operator: 1.5.1

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions