Skip to content

[Bug]: vLLM start failure leaves Requester stuck and Launcher with orphaned finalizer #444

@aavarghese

Description

@aavarghese

Contact Details

No response

What happened?

Discovered during llm-d-benchmark runs with 20 iterations by @manoelmarques. More details below:

When a vLLM instance fails to start (e.g. due to insufficient GPU memory), FMA does not
detect the failure. The requester pod stays in Running 0/1 indefinitely waiting for a
vLLM that will never become ready, and the launcher pod that was assigned a dual label
ends up with a finalizer that blocks deletion — requiring manual cleanup to recover.

Observation

  • Launcher pod stays Running 2/2 but its vLLM process failed during startup
  • Requester pod stays Running 0/1 (never becomes Ready) — waits forever on the failed vLLM
  • Example failure from launcher logs:
ValueError: Free memory on device cuda:0 (61.8/79.19 GiB) on startup is less than
    desired GPU memory utilization (0.95, 75.23 GiB). Decrease GPU memory utilization                                                                                                        
    or reduce GPU memory used by other processes.           
  • Benchmark times out waiting for the requester:
Timed out waiting for requester fma-requester-* pods to become ready after 903.9 secs.
    RuntimeError: Unable to scale replicaset fma-requester-* to 1.         
  • After teardown, the launcher pod that had a dual-pods.llm-d.ai/dual label set has a
    finalizer attached and cannot be deleted without manual intervention to remove the finalizer

Root cause

Once a launcher is assigned to a requester (dual label set + finalizer added), if vLLM fails
to load, FMA has no mechanism to:

  1. Detect the vLLM startup failure
  2. Unbind the launcher from the requester
  3. Assign a different (healthy) launcher to the requester

The finalizer is placed on the launcher pod when binding occurs. If vLLM then fails and FMA
"forgets" about the pod (e.g. because it was an ad-hoc launcher created on demand), the
finalizer prevents deletion and leaves a stuck pod that requires manual kubectl patch to remove.

Expected behavior

If a vLLM inside a bound launcher fails to load (detectable via launcher admin API
/v2/vllm/instances returning no running instance, or via a timeout on the vLLM port):

  1. FMA should detect the vLLM load failure
  2. Unbind the launcher (remove dual label, remove finalizer)
  3. Either assign a different healthy launcher to the requester, or delete the requester pod
    so the ReplicaSet recreates it with a fresh binding attempt
  4. Mark the failed launcher as unhealthy so it is not reused

Notes

  • The problem is specific to launchers that were successfully bound (dual label + finalizer set)
    but whose vLLM subsequently fails to start
  • Occurs more frequently with ad-hoc launchers since GPU memory may already be partially
    consumed by other instances on the same node
  • Observed consistently at 20 benchmark iterations; rarely at 5–10 iterations

Version

v0.5.1-alpha.6

Branch name

No response

Commit SHA

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions