[Bug] Support for fractional GPU serving

### Search before asking

- [x] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.


### KubeRay Component

ray-operator

### What happened + What you expected to happen

**Problem we want to solve**

We have servers with big GPUs, but want to serve only small LLMs and embeddings. That's the reason, why we want to deploy multiple models to a single GPU.

For this we want to to use Fractional GPU serving: https://docs.ray.io/en/latest/serve/llm/user-guides/fractional-gpu.html

**Implementation**

As explained in the linked document, we are using:

```
          placement_group_config:
            bundles:
            - GPU: "0.40"
```

**Expected behavior**

Model is placed on a worker, that has at least 0.4 GPUs available.

**Actual behavior**

Autoscaler accepts only integer and returns the following error:

```
2026-01-27 23:22:45,038	ERROR (monitor) autoscaler.py:222 -- 0.4 is not of type 'integer'
Failed validating 'type' in schema['properties']['available_node_types']['patternProperties']['.*']['properties']['resources']['patternProperties']['.*']:
    {'type': 'integer', 'minimum': 0}
```

### Reproduction script

```
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-serve
spec:
  rayClusterConfig:
    headGroupSpec:
      ...
    workerGroupSpecs:
      ...
  serveConfigV2: |
    applications:
    - name: embeddings
      import_path: ray.serve.llm:build_openai_app
      route_prefix: "/"
      args:
        llm_configs:
        - model_loading_config:
            model_id: gemma-300m
            model_source: google/embeddinggemma-300m
          engine_kwargs:
            dtype: auto
            max_model_len: 2048
            gpu_memory_utilization: 0.40
            enforce_eager: false
          deployment_config:
            num_replicas: 1
            max_ongoing_requests: 256 # max requests per instance
          placement_group_config:
            bundles:
            - GPU: "0.40"
          runtime_env:
            env_vars:
              VLLM_USE_V1: "0"
              VLLM_DISABLE_COMPILE_CACHE: "1"
```

### Anything else

Please let me know, if you need further details.

Versions:

* vLLM: 0.12.0
* Ray: 2.53.0
* Kuberay Operator: 1.5.1

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Support for fractional GPU serving #4447

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Support for fractional GPU serving #4447

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions