Skip to content

Create launcher pod in bound state#443

Open
waltforme wants to merge 11 commits intollm-d-incubation:mainfrom
waltforme:create-in-bound
Open

Create launcher pod in bound state#443
waltforme wants to merge 11 commits intollm-d-incubation:mainfrom
waltforme:create-in-bound

Conversation

@waltforme
Copy link
Copy Markdown
Collaborator

Fixes #421

Copilot AI review requested due to automatic review settings April 21, 2026 13:13
"k8s.io/apimachinery/pkg/util/sets"
k8svalidation "k8s.io/apimachinery/pkg/util/validation"
"k8s.io/apimachinery/pkg/util/strategicpatch"
k8svalidation "k8s.io/apimachinery/pkg/util/validation"
Copy link
Copy Markdown
Collaborator Author

@waltforme waltforme Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some linter suggested this reordering.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that this disorder even got merged. I thought that we had stuff both preventing and checking that. Worth an Issue.

If this fix were a separate PR then I would have already merged it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #447 for the tooling issue.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in the dual-pods controller where newly created launcher Pods could be temporarily “unbound” and therefore eligible for deletion by the launcher-populator before the vLLM instance is created/bound.

Changes:

  • Create launcher Pods with the requester annotation (and provider finalizer) already set, so they’re considered bound immediately.
  • Add reconciliation logic to detect “pre-bound but not yet fully bound” launcher Pods (no dual label, no instance ID) and create/ensure the named vLLM instance before calling bind().
  • Refactor instance-ensure logic into a helper (ensureNamedLauncherInstance).

Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some independent comments.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go Outdated
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

MikeSpreitzer commented Apr 24, 2026

On further thought, I think that the problem is a little bigger than fixing #421, and the solution needs to be addressed at the whole problem.

Regarding the problem: note that the dual-pods controller can crash right after waking (

err := ctl.wakeupInstance(ctx, lClient, iscHash, isc.Spec.ModelServerConfig.Port)
) or creating (
result, err := lClient.CreateNamedInstance(ctx, iscHash, *cfg)
) a vllm instance in a launcher --- in which case the following call to ctl.bind does not happen (before the crash).

Note also this unwritten invariant, designed in milestone 2 and built into the definition of launcher population control: vllm awake implies provider Pod is bound. Put another way: in a wake-up scenario the provider Pod is first bound and then then vllm instance is woken up, and in a go-to-sleep scenario the vllm instance is first put to sleep and then the provider Pod is unbound. But the current milestone 3 code violates this invariant in the causes where a suitable unbound launcher is sought (

launcherPod, hasSleepingInstance, someNotReady, err := ctl.selectBestLauncherPod(ctx, launcherPodAnys, iscHash, desiredPort, int(lc.Spec.MaxSleepingInstances), nodeDat)
) and found.

Let me make a couple of observations while warming up to the solution.

Note that in

if launcherBased && serverDat.InstanceID != "" && providingPod.Status.PodIP != "" {
, the && providingPod.Status.PodIP != "" is unnecessary, because serverDat.InstanceID != "" implies providingPod.Status.PodIP != "" (in the current code). BTW, it would be nice to have invariants like that written in the code; in this case, a comment somewhere in the definition of serverData. Also, this particular invariant might go away, because of the following.

Note that the vllm instance ID is known as early as it is computed --- at

cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
, which precedes the calls to bind.

So here are my current thoughts about solution.

  1. Change serverDat.InstanceID so that it no longer conveys information about vllm instance existence/discovery, it only holds the instance ID returned as iscHash at
    cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
  2. Add a field to serverData, meaningful only when launcher based, InstanceExists *bool. It is given a non-nil value after the instance sync call that is currently at
    syncResult, err, retry := ctl.syncLauncherInstances(ctx, nodeDat, providingPod)
    but: that sync will be done regardless of serverDat.InstanceID and serverDat.InstanceExists.
  3. Factor bind into two parts. The first part just updates providerPod in-memory object, and the second part does most of the rest. The first part is called before the API call to create the launcher. In the case of waking or creating a vllm instance in a previously-unbound launcher (
    if hasSleepingInstance {
    // Fast path: wake up existing sleeping instance
    logger.V(5).Info("Waking up existing vLLM instance", "iscHash", iscHash)
    err := ctl.wakeupInstance(ctx, lClient, iscHash, isc.Spec.ModelServerConfig.Port)
    if err != nil {
    return fmt.Errorf("wake up vLLM instance: %w", err), true
    }
    launcherDat.Instances[iscHash] = time.Now()
    // TODO(waltforme): the bind method may need more revision to fully handle launcher-based server providing Pods
    return ctl.bind(ctx, serverDat, requestingPod, launcherPod, &iscHash, int16(isc.Spec.ModelServerConfig.Port), isc.Spec.ModelServerConfig.Labels, isc.Spec.ModelServerConfig.Annotations)
    } else {
    // Slower path: create new instance in launcher with capacity
    logger.V(5).Info("Creating new vLLM instance", "iscHash", iscHash)
    result, err := lClient.CreateNamedInstance(ctx, iscHash, *cfg)
    if err != nil {
    return fmt.Errorf("create vLLM instance: %w", err), true
    }
    logger.V(5).Info("Created new vLLM instance",
    "instance_id", result.InstanceID,
    "status", result.Status,
    )
    launcherDat.Instances[iscHash] = time.Now()
    // TODO(waltforme): the bind method may need more revision to fully handle launcher-based server providing Pods
    return ctl.bind(ctx, serverDat, requestingPod, launcherPod, &iscHash, int16(isc.Spec.ModelServerConfig.Port), isc.Spec.ModelServerConfig.Labels, isc.Spec.ModelServerConfig.Annotations)
    }
    ), the first part is called and then the kube apiserver call is made to Update the launcher Pod; once that Update returns, this invocation of infSvrItem.process completes (no more wake or create a vllm instance here) successfully, relying on the notification of the update to trigger the remaining processing. That will get to
    if launcherBased && serverDat.InstanceID != "" && providingPod.Status.PodIP != "" {
    and the call to ctl.syncLauncherInstances. After the sync, we need handling for the new possibility of a bound launcher with the vllm instance not existing. Here is where the wake or create logic goes. Around here is where the remainder of bind goes.
  4. Here is a separate problem that is noticeable here but pre-existing and could be addressed in a follow-on PR: while a vllm instance starts up the dual-pods controller should poll for readiness (which it effectively does by querying /is_sleeping) at something like every 5 seconds, rather than go through an exponential backoff starting with something very much shorter and growing to something longer.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
- address 409 when ensuring instance inside a created-as-bound launcher
- skip waking up a freshly created instance
- don't rely on the value of the dual label

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
…e stage

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme
Copy link
Copy Markdown
Collaborator Author

The force-push to 83a5514 was a rebase onto main due to small and nonfunctional conflicts introduced by the merge of #453.

Comment thread pkg/controller/dual-pods/inference-server.go
Comment on lines +717 to +718
desiredLauncherPod.Annotations = utils.MapSet(desiredLauncherPod.Annotations, requesterAnnotationKey, string(requestingPod.UID)+" "+requestingPod.Name)
desiredLauncherPod.Labels = utils.MapSet(desiredLauncherPod.Labels, api.DualLabelName, requestingPod.Name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, BuildLauncherPodFromTemplate ensures that desiredLauncherPod.Annotations and desiredLauncherPod.Labels are not nil, so these statements can use bare indexing instead of utils.MapSet.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme
Copy link
Copy Markdown
Collaborator Author

waltforme commented Apr 24, 2026

Tried to implement the idea from this comment by 045535f.

Also created #455 for point 4 of the comment.


// IsInstanceAlreadyExistsError returns true when the launcher reports that the
// instance already exists (HTTP 409 Conflict).
func IsInstanceAlreadyExistsError(err error) bool {
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this func only determines whether the error is an HTTP 409 Conflict, and it is the caller that makes the deduction that a 409 implies that the vllm instance already exists.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.


_, deletedStopped := syncResult.deletedStoppedInstanceIDs[serverDat.InstanceID]
if deletedStopped || !instancePresent {
if serverDat.InstanceExists != nil && *serverDat.InstanceExists {
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To properly resolve the create/delete ambiguity (is this instance absent because it wasn't created yet or because it was created and then deleted?), we can not rely on the controller's memory --- the controller can crash and restart at any moment. We need to orchestrate clarity through the kube API objects. Following is the first idea that occurs to me. When reacting to a vllm instance being stopped, this controller should relay that to server-requesting Pod deletion before commanding deletion of the vllm instance.

This can be addressed in a follow-on PR if you prefer.

Unless and until the above is addressed, should the condition in this if statement start with deletedStopped || ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to address this within this PR, using the suggested approach.

return fmt.Errorf("failed to delete server-requesting Pod for stopped instance: %w", err), true
// InstanceExists is nil (unknown) — instance hasn't been created yet
// (bind-first path) or controller restarted and lost tracking.
// ensureNamedLauncherInstance is idempotent: GET first, create if not found.
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is no longer necessary. We just did ctl.syncLauncherInstances above, we know that this controller has not commanded creation of the instance. It is OK to unconditionally command creation of the instance here. In the very rare circumstance that something else concurrently commanded creation of the instance: returning a (error, true) will be adequate handling; alternatively, I think that it might be OK to assume that the concurrent thing was another copy of this controller and the concurrently-created instance is just what this copy wanted to happen.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

}

cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
_, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs)
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this once, right after setting serverDat.GPUIDs, and be done with it. Save both outputs in serverDat.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundancy.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
}
logger.V(5).Info("Ensured vLLM instance", "instance_id", result.InstanceID, "status", result.Status)
// If ISC tracking annotations are missing (pre-bound pod), complete the bind metadata.
if _, bound := providingPod.Annotations[iscLabelKeysAnnotationKey]; !bound {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bound is not the right name for the condition here. The launcher is certainly bound. The question is whether the ISC labels and annotations have been propagated yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some individual comments.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: dual-pods controller should create launcher in bound state

3 participants