Readiness probe for multi-node setup cannot be customized and might make the Ray head pod unready

### Describe the bug

I have a Kubernetes cluster with RKE2 distribution. I'm trying to run GPT Oss 20B with multi-node setup (Ray head node + Ray worker node). I used the following Helm chart:

```yaml
servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
```

However, when I deploy the Helm chart, my Ray worker pod starts to run, but my Ray head pod gives me the following message when I describe it:

Started container vllm-ray-head Warning Unhealthy 3s (x19 over 2m37s)
Readiness probe failed: Get "http://10.42.0.123:8000/health": dial tcp 10.42.0.123:8000: connect: connection refused 

I did some investigation, and it looks like this issue could the related to the readiness probe of the pod. The default values for the head pod are 1 failure and 10 seconds, as you can see on the Helm template below. In my understanding, this means that the probe will tolerate only 1 failure, and after that, it will mark the pod as un-ready. Therefore, no traffic will be allowed to the pod. 

```yaml
      spec:
        terminationGracePeriodSeconds: 0
        containers:
          - name: vllm-ray-head
            image: "vllm/vllm-openai:latest"
            command:
              - >-
                /bin/bash -c "
                cp /entrypoint/vllm-entrypoint.sh \$HOME/vllm-entrypoint.sh &&
                chmod +x \$HOME/vllm-entrypoint.sh &&
                \$HOME/vllm-entrypoint.sh &
                echo \"Running vllm command in the background.\""
            env:
              - name: VLLM_HOST_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
              - name: EXPECTED_NODES
                value: "2"
              - name: HF_HOME
                value: /tmp
              - name: LMCACHE_LOG_LEVEL
                value: "DEBUG"
              - name: VLLM_USE_V1
                value: "1"
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: vllm-secrets
                    key: hf_token_gpt-oss-20b
              - name: PYTHONHASHSEED
                value: "0"
              - name: CUDA_LAUNCH_BLOCKING
                value: "1"
              - name: LMCACHE_TRACK_USAGE
                value: "false"
            ports:
              - name: "container-port"
                containerPort: 8000
            readinessProbe:
              httpGet:
                path: /health
                port: 8000
              failureThreshold: 1
              periodSeconds: 10
            livenessProbe:
              exec:
                command: ["/bin/bash", "-c", "echo TBD"]
            resources:
              limits:
                cpu: 16
                memory: 32Gi
                nvidia.com/gpu: 2
            startupProbe:
              exec:
                command: ["/bin/bash", "-c", "python3 /scripts/wait_for_ray.py"]
              failureThreshold: 30
              periodSeconds: 15
              timeoutSeconds: 10
``` 

I'm trying to customize the readiness probe for the Ray head node, but I'm not being able to. I tried to include the following parameters on servingEngineSpec, modelSpec, and raySpec, regenerated the template, but nothing works, everytime I see the default values. 

My setup takes a long time to deploy everything, so if the readiness probe is being executed too early, it might fail. Therefore, I don't know if I'm not being able to do multi-node because of the readiness probe, and I'm not being able to customize it to test with different values.

### To Reproduce

Generate templates for the following Helm charts:

```yaml
servingEngineSpec:
  runtimeClassName: "nvidia"

  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    failureThreshold: 10
    periodSeconds: 30

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
```

```yaml
servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 10
          periodSeconds: 30

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
```

```yaml
servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      readinessProbe:
        httpGet:
          path: /health
          port: 8000
        failureThreshold: 10
        periodSeconds: 30

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
```


### Expected behavior

User will be able to customize readiness probe for Ray head on multi-node setup.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Readiness probe for multi-node setup cannot be customized and might make the Ray head pod unready #742

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Readiness probe for multi-node setup cannot be customized and might make the Ray head pod unready #742

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions