Skip to content

Readiness probe for multi-node setup cannot be customized and might make the Ray head pod unready #742

@mpaulazamin

Description

@mpaulazamin

Describe the bug

I have a Kubernetes cluster with RKE2 distribution. I'm trying to run GPT Oss 20B with multi-node setup (Ray head node + Ray worker node). I used the following Helm chart:

servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"

However, when I deploy the Helm chart, my Ray worker pod starts to run, but my Ray head pod gives me the following message when I describe it:

Started container vllm-ray-head Warning Unhealthy 3s (x19 over 2m37s)
Readiness probe failed: Get "http://10.42.0.123:8000/health": dial tcp 10.42.0.123:8000: connect: connection refused

I did some investigation, and it looks like this issue could the related to the readiness probe of the pod. The default values for the head pod are 1 failure and 10 seconds, as you can see on the Helm template below. In my understanding, this means that the probe will tolerate only 1 failure, and after that, it will mark the pod as un-ready. Therefore, no traffic will be allowed to the pod.

      spec:
        terminationGracePeriodSeconds: 0
        containers:
          - name: vllm-ray-head
            image: "vllm/vllm-openai:latest"
            command:
              - >-
                /bin/bash -c "
                cp /entrypoint/vllm-entrypoint.sh \$HOME/vllm-entrypoint.sh &&
                chmod +x \$HOME/vllm-entrypoint.sh &&
                \$HOME/vllm-entrypoint.sh &
                echo \"Running vllm command in the background.\""
            env:
              - name: VLLM_HOST_IP
                valueFrom:
                  fieldRef:
                    fieldPath: status.podIP
              - name: EXPECTED_NODES
                value: "2"
              - name: HF_HOME
                value: /tmp
              - name: LMCACHE_LOG_LEVEL
                value: "DEBUG"
              - name: VLLM_USE_V1
                value: "1"
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: vllm-secrets
                    key: hf_token_gpt-oss-20b
              - name: PYTHONHASHSEED
                value: "0"
              - name: CUDA_LAUNCH_BLOCKING
                value: "1"
              - name: LMCACHE_TRACK_USAGE
                value: "false"
            ports:
              - name: "container-port"
                containerPort: 8000
            readinessProbe:
              httpGet:
                path: /health
                port: 8000
              failureThreshold: 1
              periodSeconds: 10
            livenessProbe:
              exec:
                command: ["/bin/bash", "-c", "echo TBD"]
            resources:
              limits:
                cpu: 16
                memory: 32Gi
                nvidia.com/gpu: 2
            startupProbe:
              exec:
                command: ["/bin/bash", "-c", "python3 /scripts/wait_for_ray.py"]
              failureThreshold: 30
              periodSeconds: 15
              timeoutSeconds: 10

I'm trying to customize the readiness probe for the Ray head node, but I'm not being able to. I tried to include the following parameters on servingEngineSpec, modelSpec, and raySpec, regenerated the template, but nothing works, everytime I see the default values.

My setup takes a long time to deploy everything, so if the readiness probe is being executed too early, it might fail. Therefore, I don't know if I'm not being able to do multi-node because of the readiness probe, and I'm not being able to customize it to test with different values.

To Reproduce

Generate templates for the following Helm charts:

servingEngineSpec:
  runtimeClassName: "nvidia"

  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    failureThreshold: 10
    periodSeconds: 30

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 10
          periodSeconds: 30

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"
servingEngineSpec:
  runtimeClassName: "nvidia"

  modelSpec:
    - name: "gpt-oss-20b"
      repository: "vllm/vllm-openai"
      tag: "latest" 
      modelURL: "openai/gpt-oss-20b"

      vllmConfig:
        v0: 0
        v1: 1
        maxModelLen: 32768
        tensorParallelSize: 2
        pipelineParallelSize: 2
        gpuMemoryUtilization: 0.9
        enablePrefixCaching: true
        enableChunkedPrefill: true
        extraArgs:
          - "--trust-remote-code"
          - "--disable-log-requests"
          - "--served-model-name"
          - "gpt-oss-20b"

      replicaCount: 1

      requestCPU: 16
      requestMemory: "32Gi"
      requestGPU: 2

      readinessProbe:
        httpGet:
          path: /health
          port: 8000
        failureThreshold: 10
        periodSeconds: 30

      raySpec:
        headNode:
          requestCPU: 16
          requestMemory: "32Gi"
          requestGPU: 2

      lmcacheConfig:
        enabled: false

      env:
        - name: PYTHONHASHSEED
          value: "0"
        - name: CUDA_LAUNCH_BLOCKING 
          value: "0"
        - name: LMCACHE_TRACK_USAGE
          value: "false"

routerSpec:
  replicaCount: 3
  repository: "lmcache/lmstack-router"
  tag: "0.1.8.dev19-g3db93b87f.d20251008"

Expected behavior

User will be able to customize readiness probe for Ray head on multi-node setup.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions