What happened + What you expected to happen
What happened
In Ray 2.53, ray.init(address="auto") worked when the raylet ran in a separate container of the same Kubernetes pod, sharing /tmp/ray via an emptyDir volume. Each container had its own PID namespace (the Kubernetes default).
In Ray 2.55, the same setup fails. ray.init() hangs for 60s then raises:
RuntimeError: No node info found matching attributes: '' when trying to resolve node to connect to.
Root cause
PR #59229 changed node discovery from reading session files in temp_dir to scanning local processes via psutil.process_iter() to find the raylet process and extract its internally-assigned --node_id from the cmdline. PR #61029 added a retry loop but kept the hard requirement on the raylet being visible in the local process table.
When the raylet runs in a sibling container with a separate PID namespace, psutil cannot see it. The session files in the shared /tmp/ray volume contain all the information needed to connect, but they are no longer consulted.
Why this matters
Running the raylet in a dedicated sidecar container (separate from the driver/application container) is a natural Kubernetes pattern. It provides independent lifecycle management, resource limits, health probes, and restart policies for the raylet without coupling it to the application process. Merging PID namespaces to work around this weakens security isolation between containers in the pod.
Related
Versions / Dependencies
Versions / Dependencies
- Ray (broken): 2.55.0
- Ray (worked): 2.53.0
- OS: Linux (container images based on
rayproject/ray:2.55.0)
- Python: 3.10 (bundled in the Ray image)
- Kubernetes: 1.29+ (EKS)
- KubeRay: v1.6.0
Reproduction script
Reproduction
apiVersion: v1
kind: Pod
spec:
# shareProcessNamespace NOT set (default: false)
containers:
- name: driver
image: rayproject/ray:2.55.0
command: ["sleep", "infinity"]
env:
- name: RAY_ADDRESS
value: "auto"
volumeMounts:
- name: ray-tmp
mountPath: /tmp/ray
- name: raylet-sidecar
image: rayproject/ray:2.55.0
command: ["bash", "-c", "ray start --address=<head>:6379 --num-cpus=0 --temp-dir=/tmp/ray --block"]
volumeMounts:
- name: ray-tmp
mountPath: /tmp/ray
volumes:
- name: ray-tmp
emptyDir: {}
kubectl exec <pod> -c driver -- python -c "import ray; ray.init()"
# Hangs 60s → RuntimeError
Setting shareProcessNamespace: true fixes it, but this is not always acceptable — other containers in the pod may handle credentials and must remain PID-isolated from user code.
Issue Severity
None
What happened + What you expected to happen
What happened
In Ray 2.53,
ray.init(address="auto")worked when the raylet ran in a separate container of the same Kubernetes pod, sharing/tmp/rayvia an emptyDir volume. Each container had its own PID namespace (the Kubernetes default).In Ray 2.55, the same setup fails.
ray.init()hangs for 60s then raises:Root cause
PR #59229 changed node discovery from reading session files in
temp_dirto scanning local processes viapsutil.process_iter()to find the raylet process and extract its internally-assigned--node_idfrom the cmdline. PR #61029 added a retry loop but kept the hard requirement on the raylet being visible in the local process table.When the raylet runs in a sibling container with a separate PID namespace,
psutilcannot see it. The session files in the shared/tmp/rayvolume contain all the information needed to connect, but they are no longer consulted.Why this matters
Running the raylet in a dedicated sidecar container (separate from the driver/application container) is a natural Kubernetes pattern. It provides independent lifecycle management, resource limits, health probes, and restart policies for the raylet without coupling it to the application process. Merging PID namespaces to work around this weakens security isolation between containers in the pod.
Related
find_node_ids()Versions / Dependencies
Versions / Dependencies
rayproject/ray:2.55.0)Reproduction script
Reproduction
Setting
shareProcessNamespace: truefixes it, but this is not always acceptable — other containers in the pod may handle credentials and must remain PID-isolated from user code.Issue Severity
None