[Bug]: vLLM start failure leaves Requester stuck and Launcher with orphaned finalizer

### Contact Details

_No response_

### What happened?

Discovered during llm-d-benchmark runs with 20 iterations by @manoelmarques. More details below:

  When a vLLM instance fails to start (e.g. due to insufficient GPU memory), FMA does not                                                                                                     
  detect the failure. The requester pod stays in `Running 0/1` indefinitely waiting for a
  vLLM that will never become ready, and the launcher pod that was assigned a `dual` label                                                                                                   
  ends up with a finalizer that blocks deletion — requiring manual cleanup to recover.                                                                                                       
   
  ## Observation                                                                                                                                                                
                                                            
  - Launcher pod stays `Running 2/2` but its vLLM process failed during startup                                                                                                              
  - Requester pod stays `Running 0/1` (never becomes Ready) — waits forever on the failed vLLM
  - Example failure from launcher logs:     
                                                                                                                                          
```
ValueError: Free memory on device cuda:0 (61.8/79.19 GiB) on startup is less than
    desired GPU memory utilization (0.95, 75.23 GiB). Decrease GPU memory utilization                                                                                                        
    or reduce GPU memory used by other processes.           
```

  - Benchmark times out waiting for the requester:  
                                                                                                                                        
```
Timed out waiting for requester fma-requester-* pods to become ready after 903.9 secs.
    RuntimeError: Unable to scale replicaset fma-requester-* to 1.         
```                                                                                                                  
  - After teardown, the launcher pod that had a `dual-pods.llm-d.ai/dual` label set has a                                                                                                    
  finalizer attached and **cannot be deleted** without manual intervention to remove the finalizer
                                                                                                                                                                                             
  ## Root cause                                             
                                                                                                                                                                                             
  Once a launcher is assigned to a requester (dual label set + finalizer added), if vLLM fails                                                                                               
  to load, FMA has no mechanism to:                                                           
  1. Detect the vLLM startup failure                                                                                                                                                         
  2. Unbind the launcher from the requester                 
  3. Assign a different (healthy) launcher to the requester                                                                                                                                  
                                                                                                                                                                                             
  The finalizer is placed on the launcher pod when binding occurs. If vLLM then fails and FMA
  "forgets" about the pod (e.g. because it was an ad-hoc launcher created on demand), the                                                                                                    
  finalizer prevents deletion and leaves a stuck pod that requires manual `kubectl patch` to remove.                                                                                                                                                                                     
                                                                                                                                                                                             
  ## Expected behavior                                                                                                                                                                       
                                                                                                                                                                                             
  If a vLLM inside a bound launcher fails to load (detectable via launcher admin API                                                                                                         
  `/v2/vllm/instances` returning no running instance, or via a timeout on the vLLM port):                                                                                                    
                                                                                         
  1. FMA should detect the vLLM load failure                                                                                                                                                 
  2. Unbind the launcher (remove `dual` label, remove finalizer)
  3. Either assign a different healthy launcher to the requester, or delete the requester pod                                                                                                
   so the ReplicaSet recreates it with a fresh binding attempt                               
  4. Mark the failed launcher as unhealthy so it is not reused
                                                              
  ## Notes
                                                                             
  - The problem is specific to launchers that were successfully bound (dual label + finalizer set)
  but whose vLLM subsequently fails to start                                                      
  - Occurs more frequently with ad-hoc launchers since GPU memory may already be partially
  consumed by other instances on the same node                                                                                                                                               
  - Observed consistently at 20 benchmark iterations; rarely at 5–10 iterations  

### Version

v0.5.1-alpha.6

### Branch name

_No response_

### Commit SHA

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vLLM start failure leaves Requester stuck and Launcher with orphaned finalizer #444

Contact Details

What happened?

Observation

Root cause

Expected behavior

Notes

Version

Branch name

Commit SHA

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: vLLM start failure leaves Requester stuck and Launcher with orphaned finalizer #444

Description

Contact Details

What happened?

Observation

Root cause

Expected behavior

Notes

Version

Branch name

Commit SHA

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions