Skip to content

Sync unbound launcher-based server-providing pods#362

Merged
MikeSpreitzer merged 9 commits intollm-d-incubation:mainfrom
waltforme:sync-unbound
Mar 25, 2026
Merged

Sync unbound launcher-based server-providing pods#362
MikeSpreitzer merged 9 commits intollm-d-incubation:mainfrom
waltforme:sync-unbound

Conversation

@waltforme
Copy link
Copy Markdown
Collaborator

This PR takes unbound launcher pods into the dual-pods controller’s node-local reconciliation flow so launcher readiness, deletion, and restart recovery update internal instance state and re-drive affected requesters.

This PR also extends launcher-based e2e coverage.

Copilot AI review requested due to automatic review settings March 17, 2026 22:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR brings unbound launcher-based server-providing pods into the dual-pods controller’s node-local reconciliation so the controller can resync launcher instance state (e.g., after restart) and re-drive requesters affected by launcher lifecycle events. It also extends the launcher-based E2E suite to cover controller restart recovery and unbound launcher deletion cleanup.

Changes:

  • Add a node-scoped launcherPodItem work item and a syncLauncherInstances path to refresh in-memory launcher instance state.
  • Adjust node reconciliation to process launcher sync items before other node-local items.
  • Extend test/e2e/run-launcher-based.sh with controller restart recovery and unbound launcher deletion scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
test/e2e/run-launcher-based.sh Adds E2E scenarios for controller restart state recovery and unbound launcher deletion cleanup.
pkg/controller/dual-pods/inference-server.go Adds launcher sync work item processing and a helper to sync launcher instances into controller state; prioritizes launcher items in node reconciliation.
pkg/controller/dual-pods/controller.go Adds launcherPodItem, updates pod event handling to enqueue launcher sync, and adds enqueueRequestersOnNode / clearLauncherData.

}

// Query instances from this launcher
// Query instances from this launcher.
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still spurious to call lClient.ListInstances here, right after ensuring that the launcherDat holds an accurate list of instances. If we are not willing to trust what is in launcherDat.Instances then why maintain it at all?

There is a problem in proper maintenance of launcherDat.Instances: the true listing can change without any object in the Kube apiservers changing. So either we abandon maintaining a downstream copy, or fix its maintenance. Simplest fix is to query lClient.ListInstances every time we care and write into launcherDat.Instances --- which effectively gives up on maintaining a local copy of this answer. Another fix would be to introduce to the launcher a nonce (which can be fetched by a GET) that it changes every time the instance listing changes; prefix using launcherDat.Instances with a check that the nonce has not changed. Another version of this fix would be to have the launcher Pod itself maintain this nonce in a Pod annotation. Another possible fix would be to have the dual-pods controller periodically refresh its launcherDat.Instances even without any signal that there has been a change, and accept the consequent variable latency in reactions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks tricky enough that it probably deserves its own PR. Would it make sense to address it separately?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created Issue #375 to track this.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some independent comments. Also, nodeData.Launchers is and should remain only accessed while processing a Node, so no more synchronization is needed.

@waltforme
Copy link
Copy Markdown
Collaborator Author

Removed unnecessary synchronization for the access of nodeData.Launchers.

I left some independent comments. Also, nodeData.Launchers is and should remain only accessed while processing a Node, so no more synchronization is needed.

InferenceServers map[apitypes.UID]*serverData

// Launchers maps name of launcher-based server-providing Pod to launcherData.
// Access only while holding controller mutex.
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a statement about what does synchronize access. Perhaps something like the following.

// Access only inside `nodeItem.process()`

Similarly, every func (below nodeItem.process()) that (directly or in a called func) accesses this field should have a comment stating the restriction. Perhaps something like the following.

// Call this func only from within `nodeItem.process()`

Comment on lines +369 to +370
// For launcher pods, use the pod's own UID and name as the item identifier
return infSvrItem{pod.UID, pod.Name}, infSvrItemLauncherBasedProvider
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hacky. The first value being returned here is not really a reference to an inference server. It would be more accurate for this func's return type to be (infSvrItem, launcherPodItem, infSvrItemType) (and, yeah, generalize that latter type name).

This is not a critical problem.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished another round of review.


// Launchers maps name of launcher-based server-providing Pod to launcherData.
// Access only while holding controller mutex.
// Access only inside the calling hierarchy that `nodeItem.process()` is the root caller.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"root" is not right, since it is not at the coldest end of the stack. Maybe something like the following?

Access only while nodeItem.process is on the call stack.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subtree has a root as well.

I think the current expression and the suggested expression are equivalent.

if _, instExists := launcherDat.Instances[iscHash]; instExists {
hasSleepingInstance := false
for _, inst := range insts.Instances {
if inst.InstanceID == iscHash {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launcherPodAnys contains all launchers made from the right LauncherConfig object, right? Including ones with an awake child whose InstanceID == iscHash, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not including an awake child because of the same logic as

// They have to be sleeping, the Kube scheduler and kubelet would not have assigned the same
// node/gpus to the requester if there was another one awake.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly analogous. You cited logic for the direct (milestone 2) case, in which the index value is a hash that takes the node name and the GPU list into account. In that case it is reasonable to expect the mentioned Kube Pod scheduler behavior.

But here in the launcher-based case, the index is on the hash of the LauncherConfig augmented by the node name and "gpus=all". So this index is not so discriminating, and could include other existing launcher Pods with the right LauncherConfig and node but being used for an awake vllm instance using different GPU(s).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info for GPUs are hashed into iscHash, right?

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The reason that I mentioned iscHash as well is that the two hashes work here together.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. If the iscHash matches then the GPU sets are the same and so the instance being considered must be sleeping.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is tricky enough to warrant a comment too.

newInstances[inst.InstanceID] = time.Now()
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an awake instance evaporated then this should cause the relevant infSvrItem to be enqueued.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tracked by #375.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided the E2E test on OpenShift succeeds.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be rebased onto main so that the E2E test on OpenShift can succeed. (In main the workflow invokes a script that was added recently.)

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
@waltforme
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@waltforme
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

Taints are a thing in Kubernetes, FMA should work correctly in their presence.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

The E2E test on OpenShift passed.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MikeSpreitzer MikeSpreitzer merged commit 5570d18 into llm-d-incubation:main Mar 25, 2026
25 checks passed
@waltforme waltforme deleted the sync-unbound branch March 25, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants