Create launcher pod in bound state by waltforme · Pull Request #443 · llm-d-incubation/llm-d-fast-model-actuation

waltforme · 2026-04-21T13:13:54Z

Fixes #421

waltforme · 2026-04-21T13:15:48Z

 	"k8s.io/apimachinery/pkg/util/sets"
-	k8svalidation "k8s.io/apimachinery/pkg/util/validation"
 	"k8s.io/apimachinery/pkg/util/strategicpatch"
+	k8svalidation "k8s.io/apimachinery/pkg/util/validation"


Some linter suggested this reordering.

I am surprised that this disorder even got merged. I thought that we had stuff both preventing and checking that. Worth an Issue.

If this fix were a separate PR then I would have already merged it.

I opened #447 for the tooling issue.

Copilot

Pull request overview

Fixes a race in the dual-pods controller where newly created launcher Pods could be temporarily “unbound” and therefore eligible for deletion by the launcher-populator before the vLLM instance is created/bound.

Changes:

Create launcher Pods with the requester annotation (and provider finalizer) already set, so they’re considered bound immediately.
Add reconciliation logic to detect “pre-bound but not yet fully bound” launcher Pods (no dual label, no instance ID) and create/ensure the named vLLM instance before calling bind().
Refactor instance-ensure logic into a helper (ensureNamedLauncherInstance).

MikeSpreitzer

I left some independent comments.

MikeSpreitzer · 2026-04-21T19:07:58Z

/ok-to-test

github-actions · 2026-04-21T19:08:08Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer · 2026-04-24T04:55:14Z

On further thought, I think that the problem is a little bigger than fixing #421, and the solution needs to be addressed at the whole problem.

Regarding the problem: note that the dual-pods controller can crash right after waking (

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 619 in 9ae47f4

    
           err := ctl.wakeupInstance(ctx, lClient, iscHash, isc.Spec.ModelServerConfig.Port)

) or creating (

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 629 in 9ae47f4

result, err := lClient.CreateNamedInstance(ctx, iscHash, *cfg)

) a vllm instance in a launcher --- in which case the following call to ctl.bind does not happen (before the crash).

Note also this unwritten invariant, designed in milestone 2 and built into the definition of launcher population control: vllm awake implies provider Pod is bound. Put another way: in a wake-up scenario the provider Pod is first bound and then then vllm instance is woken up, and in a go-to-sleep scenario the vllm instance is first put to sleep and then the provider Pod is unbound. But the current milestone 3 code violates this invariant in the causes where a suitable unbound launcher is sought (

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 590 in 9ae47f4

    
           launcherPod, hasSleepingInstance, someNotReady, err := ctl.selectBestLauncherPod(ctx, launcherPodAnys, iscHash, desiredPort, int(lc.Spec.MaxSleepingInstances), nodeDat)

) and found.

Let me make a couple of observations while warming up to the solution.

Note that in

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 400 in 9ae47f4

    
           if launcherBased && serverDat.InstanceID != "" && providingPod.Status.PodIP != "" {

, the && providingPod.Status.PodIP != "" is unnecessary, because serverDat.InstanceID != "" implies providingPod.Status.PodIP != "" (in the current code). BTW, it would be nice to have invariants like that written in the code; in this case, a comment somewhere in the definition of serverData. Also, this particular invariant might go away, because of the following.

Note that the vllm instance ID is known as early as it is computed --- at

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 577 in 9ae47f4

cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)

, which precedes the calls to bind.

So here are my current thoughts about solution.

Change serverDat.InstanceID so that it no longer conveys information about vllm instance existence/discovery, it only holds the instance ID returned as iscHash at

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 577 in 9ae47f4

cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
Add a field to serverData, meaningful only when launcher based, InstanceExists *bool. It is given a non-nil value after the instance sync call that is currently at

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 401 in 9ae47f4

syncResult, err, retry := ctl.syncLauncherInstances(ctx, nodeDat, providingPod)

but: that sync will be done regardless of serverDat.InstanceID and serverDat.InstanceExists.

Factor bind into two parts. The first part just updates providerPod in-memory object, and the second part does most of the rest. The first part is called before the API call to create the launcher. In the case of waking or creating a vllm instance in a previously-unbound launcher (

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Lines 616 to 640 in 9ae47f4

    
           if hasSleepingInstance { 
        
           	// Fast path: wake up existing sleeping instance 
        
           	logger.V(5).Info("Waking up existing vLLM instance", "iscHash", iscHash) 
        
           	err := ctl.wakeupInstance(ctx, lClient, iscHash, isc.Spec.ModelServerConfig.Port) 
        
           	if err != nil { 
        
           		return fmt.Errorf("wake up vLLM instance: %w", err), true 
        
           	} 
        
           	launcherDat.Instances[iscHash] = time.Now() 
        
           	// TODO(waltforme): the bind method may need more revision to fully handle launcher-based server providing Pods 
        
           	return ctl.bind(ctx, serverDat, requestingPod, launcherPod, &iscHash, int16(isc.Spec.ModelServerConfig.Port), isc.Spec.ModelServerConfig.Labels, isc.Spec.ModelServerConfig.Annotations) 
        
           } else { 
        
           	// Slower path: create new instance in launcher with capacity 
        
           	logger.V(5).Info("Creating new vLLM instance", "iscHash", iscHash) 
        
           	result, err := lClient.CreateNamedInstance(ctx, iscHash, *cfg) 
        
           	if err != nil { 
        
           		return fmt.Errorf("create vLLM instance: %w", err), true 
        
           	} 
        
           	logger.V(5).Info("Created new vLLM instance", 
        
           		"instance_id", result.InstanceID, 
        
           		"status", result.Status, 
        
           	) 
        
           	launcherDat.Instances[iscHash] = time.Now() 
        
           	// TODO(waltforme): the bind method may need more revision to fully handle launcher-based server providing Pods 
        
           	return ctl.bind(ctx, serverDat, requestingPod, launcherPod, &iscHash, int16(isc.Spec.ModelServerConfig.Port), isc.Spec.ModelServerConfig.Labels, isc.Spec.ModelServerConfig.Annotations) 
        
           }

), the first part is called and then the kube apiserver call is made to Update the launcher Pod; once that Update returns, this invocation of infSvrItem.process completes (no more wake or create a vllm instance here) successfully, relying on the notification of the update to trigger the remaining processing. That will get to

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Line 400 in 9ae47f4

    
           if launcherBased && serverDat.InstanceID != "" && providingPod.Status.PodIP != "" {

and the call to ctl.syncLauncherInstances. After the sync, we need handling for the new possibility of a bound launcher with the vllm instance not existing. Here is where the wake or create logic goes. Around here is where the remainder of bind goes.

Here is a separate problem that is noticeable here but pre-existing and could be addressed in a follow-on PR: while a vllm instance starts up the dual-pods controller should poll for readiness (which it effectively does by querying /is_sleeping) at something like every 5 seconds, rather than go through an exponential backoff starting with something very much shorter and growing to something longer.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

- address 409 when ensuring instance inside a created-as-bound launcher - skip waking up a freshly created instance - don't rely on the value of the dual label Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

…e stage Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

waltforme · 2026-04-24T15:45:46Z

The force-push to 83a5514 was a rebase onto main due to small and nonfunctional conflicts introduced by the merge of #453.

MikeSpreitzer · 2026-04-24T17:42:39Z

+	desiredLauncherPod.Annotations = utils.MapSet(desiredLauncherPod.Annotations, requesterAnnotationKey, string(requestingPod.UID)+" "+requestingPod.Name)
+	desiredLauncherPod.Labels = utils.MapSet(desiredLauncherPod.Labels, api.DualLabelName, requestingPod.Name)


FYI, BuildLauncherPodFromTemplate ensures that desiredLauncherPod.Annotations and desiredLauncherPod.Labels are not nil, so these statements can use bare indexing instead of utils.MapSet.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

waltforme · 2026-04-24T19:14:42Z

Tried to implement the idea from this comment by 045535f.

Also created #455 for point 4 of the comment.

MikeSpreitzer · 2026-04-24T19:53:33Z


+// IsInstanceAlreadyExistsError returns true when the launcher reports that the
+// instance already exists (HTTP 409 Conflict).
+func IsInstanceAlreadyExistsError(err error) bool {


Technically this func only determines whether the error is an HTTP 409 Conflict, and it is the caller that makes the deduction that a 409 implies that the vllm instance already exists.

MikeSpreitzer · 2026-04-24T20:11:30Z

+
+			_, deletedStopped := syncResult.deletedStoppedInstanceIDs[serverDat.InstanceID]
+			if deletedStopped || !instancePresent {
+				if serverDat.InstanceExists != nil && *serverDat.InstanceExists {


To properly resolve the create/delete ambiguity (is this instance absent because it wasn't created yet or because it was created and then deleted?), we can not rely on the controller's memory --- the controller can crash and restart at any moment. We need to orchestrate clarity through the kube API objects. Following is the first idea that occurs to me. When reacting to a vllm instance being stopped, this controller should relay that to server-requesting Pod deletion before commanding deletion of the vllm instance.

This can be addressed in a follow-on PR if you prefer.

Unless and until the above is addressed, should the condition in this if statement start with deletedStopped || ?

I tried to address this within this PR, using the suggested approach.

MikeSpreitzer · 2026-04-24T20:37:07Z

-					return fmt.Errorf("failed to delete server-requesting Pod for stopped instance: %w", err), true
+				// InstanceExists is nil (unknown) — instance hasn't been created yet
+				// (bind-first path) or controller restarted and lost tracking.
+				// ensureNamedLauncherInstance is idempotent: GET first, create if not found.


That is no longer necessary. We just did ctl.syncLauncherInstances above, we know that this controller has not commanded creation of the instance. It is OK to unconditionally command creation of the instance here. In the very rare circumstance that something else concurrently commanded creation of the instance: returning a (error, true) will be adequate handling; alternatively, I think that it might be OK to assume that the concurrent thing was another copy of this controller and the concurrently-created instance is just what this copy wanted to happen.

MikeSpreitzer · 2026-04-24T20:50:39Z

 	}

-	cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
+	_, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)


Call this once, right after setting serverDat.GPUIDs, and be done with it. Save both outputs in serverDat.

Removed the redundancy.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

MikeSpreitzer · 2026-04-24T21:20:55Z

+				}
+				logger.V(5).Info("Ensured vLLM instance", "instance_id", result.InstanceID, "status", result.Status)
+				// If ISC tracking annotations are missing (pre-bound pod), complete the bind metadata.
+				if _, bound := providingPod.Annotations[iscLabelKeysAnnotationKey]; !bound {


bound is not the right name for the condition here. The launcher is certainly bound. The question is whether the ISC labels and annotations have been propagated yet.

MikeSpreitzer

I left some individual comments.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

waltforme · 2026-04-25T00:29:11Z

/ok-to-test

github-actions · 2026-04-25T00:29:19Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

waltforme · 2026-04-25T00:39:48Z

/ok-to-test

github-actions · 2026-04-25T00:39:57Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Copilot AI review requested due to automatic review settings April 21, 2026 13:13

Copilot started reviewing on behalf of waltforme April 21, 2026 13:14 View session

waltforme commented Apr 21, 2026

View reviewed changes

Copilot AI reviewed Apr 21, 2026

View reviewed changes