-
Notifications
You must be signed in to change notification settings - Fork 698
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
It may be beneficial to support creating multiple Ray workers on the same Pod created with KubeRay.
There is more context in this PR: ai-on-gke/kuberay-tpu-webhook#19, which supports multiple TPU containers with KubeRay, and specifically this comment: ai-on-gke/kuberay-tpu-webhook#19 (comment).
I was able to create a RayCluster with 2 workers, each with 2 Ray containers (so 4 Ray nodes total), and run a workload on it. However, in order to avoid port conflicts and pass the correct Ray resources it's necessary to manually construct a ray start command for the second container. The amount of manual intervention required by the user would be reduced if KubeRay updated the resource detection logic and port assignment to handle multiple containers automatically.
Use case
TPU v7x introduces a dual-chiplet architecture where a standard 4-chip VM spans two distinct NUMA nodes. To optimize memory bandwidth and avoid cross-NUMA latency, v7x workloads can now run as multiple NUMA-aligned containers within a single Pod. This would entail multiple Ray containers on the same Pod created by KubeRay.
More context on the new accelerator: https://docs.cloud.google.com/tpu/docs/tpu7x
Related issues
N/A
Are you willing to submit a PR?
- Yes I am willing to submit a PR!