Skip to content

[Test] Add support for fractional GPU values in Ray start parameters …#4454

Draft
tiennguyentony wants to merge 7 commits intoray-project:masterfrom
tiennguyentony:fix/4447-fractional-gpu-support
Draft

[Test] Add support for fractional GPU values in Ray start parameters …#4454
tiennguyentony wants to merge 7 commits intoray-project:masterfrom
tiennguyentony:fix/4447-fractional-gpu-support

Conversation

@tiennguyentony
Copy link

@tiennguyentony tiennguyentony commented Jan 28, 2026

[Feature] Add support for fractional GPU values in Ray start parameters and corresponding tests

Why are these changes needed?

This PR adds support for fractional GPU values in Ray start parameters, addressing issue #4447.

Problem: Users need to serve multiple small LLM models on a single GPU using Ray's fractional GPU serving feature (e.g., 0.4 GPU per model). The autoscaler was rejecting fractional GPU values with the error: "0.4 is not of type 'integer'".

Solution:

  • Modified pod.go: Changed GPU resource conversion from int64() to float64() to support fractional values
  • Added unit test in pod_test.go: TestUpdateRayStartParamsResources_WithFractionalGPU validates the conversion logic
  • Added e2e test in raycluster_test.go: TestRayClusterWithFractionalGPU validates end-to-end integration

This enables users to specify fractional GPU allocations like GPU: "0.4" in their Ray placement groups for efficient multi-model serving.

Related issue number

Closes #4447

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests - TestUpdateRayStartParamsResources_WithFractionalGPU validates GPU conversion logic
    • Manual tests - Ran e2e test TestRayClusterWithFractionalGPU locally (passes in 1.07s)
    • This PR is not tested :(

Test Results

=== RUN   TestRayClusterWithFractionalGPU
    raycluster_test.go:327: [2026-01-28] Created RayCluster for testing fractional GPU conversion
    raycluster_test.go:343: [2026-01-28] RayCluster pods created successfully
    raycluster_test.go:366: ✓ Test passed: RayCluster with fractional GPU configuration created successfully
--- PASS: TestRayClusterWithFractionalGPU (1.07s)
PASS

Changes Summary

File Lines Changed Description
ray-operator/controllers/ray/common/pod.go 4 (+3, -1) Core fix: Convert GPU resources using float64 instead of int64
ray-operator/controllers/ray/common/pod_test.go 41 (+41) Unit test for fractional GPU conversion
ray-operator/test/e2e/raycluster_test.go 102 (+102) E2E test for RayCluster with fractional GPU config
Total 147 (+146, -1)

@tiennguyentony tiennguyentony marked this pull request as draft January 28, 2026 20:05
@tiennguyentony tiennguyentony marked this pull request as ready for review January 28, 2026 20:06
…cceleratorResources

Critical bug fix: The addWellKnownAcceleratorResources function was using
strconv.FormatInt which truncated fractional GPU values to integers. When
users specify GPU resources via container.Resources.Limits (the standard
Kubernetes pattern), values like 400m (0.4 GPU) were truncated to 0.

This fix applies the same FormatFloat conversion used in updateRayStartParamsResources,
ensuring both code paths properly handle fractional GPU values:
  - 400m  0.4 GPU
  - 1  1 GPU
  - 4  4 GPUs

Added unit test TestAddWellKnownAcceleratorResources_WithFractionalGPU to
validate the fix covers container resource limits.

Fixes Issue ray-project#4447: Enable fractional GPU serving support
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@tiennguyentony tiennguyentony marked this pull request as draft January 29, 2026 00:43
…of fractional GPU resources to Ray start parameters
…PU by removing unnecessary GroupResource wrapper
…sterWithFractionalGPU

- Changed WithResources(rayv1ac.GroupResource().WithRequestedResources(...)) to WithResources(map[string]string{...})
- Fixed API usage to match the correct signature for setting resource specs in worker group
- Added 2-second graceful shutdown to allow operator cleanup before namespace deletion
- Prevents race condition where test cleanup happens before operator finishes cleanup operations
- Fixes issue ray-project#4447: Add support for fractional GPU values in Ray start parameters
…nt namespace termination race

- Added 2-second sleep before namespace deletion in TestRayClusterWithResourceQuota
- Prevents 'unable to create new content in namespace because it is being terminated' error
- Same fix as applied to TestRayClusterWithFractionalGPU
- Addresses CI test flakiness during cleanup phase
Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Support for fractional GPU serving

1 participant