Skip to content

[Feature] Support multiple Ray containers per Pod #4455

@ryanaoleary

Description

@ryanaoleary

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

It may be beneficial to support creating multiple Ray workers on the same Pod created with KubeRay.

There is more context in this PR: ai-on-gke/kuberay-tpu-webhook#19, which supports multiple TPU containers with KubeRay, and specifically this comment: ai-on-gke/kuberay-tpu-webhook#19 (comment).

I was able to create a RayCluster with 2 workers, each with 2 Ray containers (so 4 Ray nodes total), and run a workload on it. However, in order to avoid port conflicts and pass the correct Ray resources it's necessary to manually construct a ray start command for the second container. The amount of manual intervention required by the user would be reduced if KubeRay updated the resource detection logic and port assignment to handle multiple containers automatically.

Use case

TPU v7x introduces a dual-chiplet architecture where a standard 4-chip VM spans two distinct NUMA nodes. To optimize memory bandwidth and avoid cross-NUMA latency, v7x workloads can now run as multiple NUMA-aligned containers within a single Pod. This would entail multiple Ray containers on the same Pod created by KubeRay.

More context on the new accelerator: https://docs.cloud.google.com/tpu/docs/tpu7x

Related issues

N/A

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions