Contact Details
No response
What happened?
Discovered during llm-d-benchmark runs with 20 iterations by @manoelmarques. More details below:
When a vLLM instance fails to start (e.g. due to insufficient GPU memory), FMA does not
detect the failure. The requester pod stays in Running 0/1 indefinitely waiting for a
vLLM that will never become ready, and the launcher pod that was assigned a dual label
ends up with a finalizer that blocks deletion — requiring manual cleanup to recover.
Observation
- Launcher pod stays
Running 2/2 but its vLLM process failed during startup
- Requester pod stays
Running 0/1 (never becomes Ready) — waits forever on the failed vLLM
- Example failure from launcher logs:
ValueError: Free memory on device cuda:0 (61.8/79.19 GiB) on startup is less than
desired GPU memory utilization (0.95, 75.23 GiB). Decrease GPU memory utilization
or reduce GPU memory used by other processes.
- Benchmark times out waiting for the requester:
Timed out waiting for requester fma-requester-* pods to become ready after 903.9 secs.
RuntimeError: Unable to scale replicaset fma-requester-* to 1.
- After teardown, the launcher pod that had a
dual-pods.llm-d.ai/dual label set has a
finalizer attached and cannot be deleted without manual intervention to remove the finalizer
Root cause
Once a launcher is assigned to a requester (dual label set + finalizer added), if vLLM fails
to load, FMA has no mechanism to:
- Detect the vLLM startup failure
- Unbind the launcher from the requester
- Assign a different (healthy) launcher to the requester
The finalizer is placed on the launcher pod when binding occurs. If vLLM then fails and FMA
"forgets" about the pod (e.g. because it was an ad-hoc launcher created on demand), the
finalizer prevents deletion and leaves a stuck pod that requires manual kubectl patch to remove.
Expected behavior
If a vLLM inside a bound launcher fails to load (detectable via launcher admin API
/v2/vllm/instances returning no running instance, or via a timeout on the vLLM port):
- FMA should detect the vLLM load failure
- Unbind the launcher (remove
dual label, remove finalizer)
- Either assign a different healthy launcher to the requester, or delete the requester pod
so the ReplicaSet recreates it with a fresh binding attempt
- Mark the failed launcher as unhealthy so it is not reused
Notes
- The problem is specific to launchers that were successfully bound (dual label + finalizer set)
but whose vLLM subsequently fails to start
- Occurs more frequently with ad-hoc launchers since GPU memory may already be partially
consumed by other instances on the same node
- Observed consistently at 20 benchmark iterations; rarely at 5–10 iterations
Version
v0.5.1-alpha.6
Branch name
No response
Commit SHA
No response
Relevant log output
Contact Details
No response
What happened?
Discovered during llm-d-benchmark runs with 20 iterations by @manoelmarques. More details below:
When a vLLM instance fails to start (e.g. due to insufficient GPU memory), FMA does not
detect the failure. The requester pod stays in
Running 0/1indefinitely waiting for avLLM that will never become ready, and the launcher pod that was assigned a
duallabelends up with a finalizer that blocks deletion — requiring manual cleanup to recover.
Observation
Running 2/2but its vLLM process failed during startupRunning 0/1(never becomes Ready) — waits forever on the failed vLLMdual-pods.llm-d.ai/duallabel set has afinalizer attached and cannot be deleted without manual intervention to remove the finalizer
Root cause
Once a launcher is assigned to a requester (dual label set + finalizer added), if vLLM fails
to load, FMA has no mechanism to:
The finalizer is placed on the launcher pod when binding occurs. If vLLM then fails and FMA
"forgets" about the pod (e.g. because it was an ad-hoc launcher created on demand), the
finalizer prevents deletion and leaves a stuck pod that requires manual
kubectl patchto remove.Expected behavior
If a vLLM inside a bound launcher fails to load (detectable via launcher admin API
/v2/vllm/instancesreturning no running instance, or via a timeout on the vLLM port):duallabel, remove finalizer)so the ReplicaSet recreates it with a fresh binding attempt
Notes
but whose vLLM subsequently fails to start
consumed by other instances on the same node
Version
v0.5.1-alpha.6
Branch name
No response
Commit SHA
No response
Relevant log output