-
Notifications
You must be signed in to change notification settings - Fork 326
Description
Describe the bug
I have a Kubernetes cluster with RKE2 distribution. I'm trying to run GPT Oss 20B with multi-node setup (Ray head node + Ray worker node). I used the following Helm chart:
servingEngineSpec:
runtimeClassName: "nvidia"
modelSpec:
- name: "gpt-oss-20b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "openai/gpt-oss-20b"
vllmConfig:
v0: 0
v1: 1
maxModelLen: 32768
tensorParallelSize: 2
pipelineParallelSize: 2
gpuMemoryUtilization: 0.9
enablePrefixCaching: true
enableChunkedPrefill: true
extraArgs:
- "--trust-remote-code"
- "--disable-log-requests"
- "--served-model-name"
- "gpt-oss-20b"
replicaCount: 1
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
raySpec:
headNode:
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
lmcacheConfig:
enabled: false
env:
- name: PYTHONHASHSEED
value: "0"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: LMCACHE_TRACK_USAGE
value: "false"
routerSpec:
replicaCount: 3
repository: "lmcache/lmstack-router"
tag: "0.1.8.dev19-g3db93b87f.d20251008"However, when I deploy the Helm chart, my Ray worker pod starts to run, but my Ray head pod gives me the following message when I describe it:
Started container vllm-ray-head Warning Unhealthy 3s (x19 over 2m37s)
Readiness probe failed: Get "http://10.42.0.123:8000/health": dial tcp 10.42.0.123:8000: connect: connection refused
I did some investigation, and it looks like this issue could the related to the readiness probe of the pod. The default values for the head pod are 1 failure and 10 seconds, as you can see on the Helm template below. In my understanding, this means that the probe will tolerate only 1 failure, and after that, it will mark the pod as un-ready. Therefore, no traffic will be allowed to the pod.
spec:
terminationGracePeriodSeconds: 0
containers:
- name: vllm-ray-head
image: "vllm/vllm-openai:latest"
command:
- >-
/bin/bash -c "
cp /entrypoint/vllm-entrypoint.sh \$HOME/vllm-entrypoint.sh &&
chmod +x \$HOME/vllm-entrypoint.sh &&
\$HOME/vllm-entrypoint.sh &
echo \"Running vllm command in the background.\""
env:
- name: VLLM_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: EXPECTED_NODES
value: "2"
- name: HF_HOME
value: /tmp
- name: LMCACHE_LOG_LEVEL
value: "DEBUG"
- name: VLLM_USE_V1
value: "1"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: vllm-secrets
key: hf_token_gpt-oss-20b
- name: PYTHONHASHSEED
value: "0"
- name: CUDA_LAUNCH_BLOCKING
value: "1"
- name: LMCACHE_TRACK_USAGE
value: "false"
ports:
- name: "container-port"
containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 1
periodSeconds: 10
livenessProbe:
exec:
command: ["/bin/bash", "-c", "echo TBD"]
resources:
limits:
cpu: 16
memory: 32Gi
nvidia.com/gpu: 2
startupProbe:
exec:
command: ["/bin/bash", "-c", "python3 /scripts/wait_for_ray.py"]
failureThreshold: 30
periodSeconds: 15
timeoutSeconds: 10I'm trying to customize the readiness probe for the Ray head node, but I'm not being able to. I tried to include the following parameters on servingEngineSpec, modelSpec, and raySpec, regenerated the template, but nothing works, everytime I see the default values.
My setup takes a long time to deploy everything, so if the readiness probe is being executed too early, it might fail. Therefore, I don't know if I'm not being able to do multi-node because of the readiness probe, and I'm not being able to customize it to test with different values.
To Reproduce
Generate templates for the following Helm charts:
servingEngineSpec:
runtimeClassName: "nvidia"
readinessProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 10
periodSeconds: 30
modelSpec:
- name: "gpt-oss-20b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "openai/gpt-oss-20b"
vllmConfig:
v0: 0
v1: 1
maxModelLen: 32768
tensorParallelSize: 2
pipelineParallelSize: 2
gpuMemoryUtilization: 0.9
enablePrefixCaching: true
enableChunkedPrefill: true
extraArgs:
- "--trust-remote-code"
- "--disable-log-requests"
- "--served-model-name"
- "gpt-oss-20b"
replicaCount: 1
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
raySpec:
headNode:
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
lmcacheConfig:
enabled: false
env:
- name: PYTHONHASHSEED
value: "0"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: LMCACHE_TRACK_USAGE
value: "false"
routerSpec:
replicaCount: 3
repository: "lmcache/lmstack-router"
tag: "0.1.8.dev19-g3db93b87f.d20251008"servingEngineSpec:
runtimeClassName: "nvidia"
modelSpec:
- name: "gpt-oss-20b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "openai/gpt-oss-20b"
vllmConfig:
v0: 0
v1: 1
maxModelLen: 32768
tensorParallelSize: 2
pipelineParallelSize: 2
gpuMemoryUtilization: 0.9
enablePrefixCaching: true
enableChunkedPrefill: true
extraArgs:
- "--trust-remote-code"
- "--disable-log-requests"
- "--served-model-name"
- "gpt-oss-20b"
replicaCount: 1
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
raySpec:
headNode:
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
readinessProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 10
periodSeconds: 30
lmcacheConfig:
enabled: false
env:
- name: PYTHONHASHSEED
value: "0"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: LMCACHE_TRACK_USAGE
value: "false"
routerSpec:
replicaCount: 3
repository: "lmcache/lmstack-router"
tag: "0.1.8.dev19-g3db93b87f.d20251008"servingEngineSpec:
runtimeClassName: "nvidia"
modelSpec:
- name: "gpt-oss-20b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "openai/gpt-oss-20b"
vllmConfig:
v0: 0
v1: 1
maxModelLen: 32768
tensorParallelSize: 2
pipelineParallelSize: 2
gpuMemoryUtilization: 0.9
enablePrefixCaching: true
enableChunkedPrefill: true
extraArgs:
- "--trust-remote-code"
- "--disable-log-requests"
- "--served-model-name"
- "gpt-oss-20b"
replicaCount: 1
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
readinessProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 10
periodSeconds: 30
raySpec:
headNode:
requestCPU: 16
requestMemory: "32Gi"
requestGPU: 2
lmcacheConfig:
enabled: false
env:
- name: PYTHONHASHSEED
value: "0"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: LMCACHE_TRACK_USAGE
value: "false"
routerSpec:
replicaCount: 3
repository: "lmcache/lmstack-router"
tag: "0.1.8.dev19-g3db93b87f.d20251008"Expected behavior
User will be able to customize readiness probe for Ray head on multi-node setup.
Additional context
No response