Skip to content

[Core] ray.init() requires local raylet process visibility — breaks Kubernetes sidecar deployments (regression from 2.53) #63923

@HariniNarayanan

Description

@HariniNarayanan

What happened + What you expected to happen

What happened

In Ray 2.53, ray.init(address="auto") worked when the raylet ran in a separate container of the same Kubernetes pod, sharing /tmp/ray via an emptyDir volume. Each container had its own PID namespace (the Kubernetes default).

In Ray 2.55, the same setup fails. ray.init() hangs for 60s then raises:

RuntimeError: No node info found matching attributes: '' when trying to resolve node to connect to.

Root cause

PR #59229 changed node discovery from reading session files in temp_dir to scanning local processes via psutil.process_iter() to find the raylet process and extract its internally-assigned --node_id from the cmdline. PR #61029 added a retry loop but kept the hard requirement on the raylet being visible in the local process table.

When the raylet runs in a sibling container with a separate PID namespace, psutil cannot see it. The session files in the shared /tmp/ray volume contain all the information needed to connect, but they are no longer consulted.

Why this matters

Running the raylet in a dedicated sidecar container (separate from the driver/application container) is a natural Kubernetes pattern. It provides independent lifecycle management, resource limits, health probes, and restart policies for the raylet without coupling it to the application process. Merging PID namespaces to work around this weakens security isolation between containers in the pod.

Related

Versions / Dependencies

Versions / Dependencies

  • Ray (broken): 2.55.0
  • Ray (worked): 2.53.0
  • OS: Linux (container images based on rayproject/ray:2.55.0)
  • Python: 3.10 (bundled in the Ray image)
  • Kubernetes: 1.29+ (EKS)
  • KubeRay: v1.6.0

Reproduction script

Reproduction

apiVersion: v1
kind: Pod
spec:
  # shareProcessNamespace NOT set (default: false)
  containers:
    - name: driver
      image: rayproject/ray:2.55.0
      command: ["sleep", "infinity"]
      env:
        - name: RAY_ADDRESS
          value: "auto"
      volumeMounts:
        - name: ray-tmp
          mountPath: /tmp/ray
    - name: raylet-sidecar
      image: rayproject/ray:2.55.0
      command: ["bash", "-c", "ray start --address=<head>:6379 --num-cpus=0 --temp-dir=/tmp/ray --block"]
      volumeMounts:
        - name: ray-tmp
          mountPath: /tmp/ray
  volumes:
    - name: ray-tmp
      emptyDir: {}
kubectl exec <pod> -c driver -- python -c "import ray; ray.init()"
# Hangs 60s → RuntimeError

Setting shareProcessNamespace: true fixes it, but this is not always acceptable — other containers in the pod may handle credentials and must remain PID-isolated from user code.

Issue Severity

None

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Coreregressionstability

Type

No type
No fields configured for issues without a type.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions