Sync unbound launcher-based server-providing pods by waltforme · Pull Request #362 · llm-d-incubation/llm-d-fast-model-actuation

waltforme · 2026-03-17T22:28:19Z

This PR takes unbound launcher pods into the dual-pods controller’s node-local reconciliation flow so launcher readiness, deletion, and restart recovery update internal instance state and re-drive affected requesters.

This PR also extends launcher-based e2e coverage.

Copilot

Pull request overview

This PR brings unbound launcher-based server-providing pods into the dual-pods controller’s node-local reconciliation so the controller can resync launcher instance state (e.g., after restart) and re-drive requesters affected by launcher lifecycle events. It also extends the launcher-based E2E suite to cover controller restart recovery and unbound launcher deletion cleanup.

Changes:

Add a node-scoped launcherPodItem work item and a syncLauncherInstances path to refresh in-memory launcher instance state.
Adjust node reconciliation to process launcher sync items before other node-local items.
Extend test/e2e/run-launcher-based.sh with controller restart recovery and unbound launcher deletion scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
test/e2e/run-launcher-based.sh	Adds E2E scenarios for controller restart state recovery and unbound launcher deletion cleanup.
pkg/controller/dual-pods/inference-server.go	Adds launcher sync work item processing and a helper to sync launcher instances into controller state; prioritizes launcher items in node reconciliation.
pkg/controller/dual-pods/controller.go	Adds `launcherPodItem`, updates pod event handling to enqueue launcher sync, and adds `enqueueRequestersOnNode` / `clearLauncherData`.

pkg/controller/dual-pods/controller.go

pkg/controller/dual-pods/inference-server.go

pkg/controller/dual-pods/controller.go

pkg/controller/dual-pods/inference-server.go

MikeSpreitzer · 2026-03-18T18:55:34Z

pkg/controller/dual-pods/inference-server.go

 		}

-		// Query instances from this launcher
+		// Query instances from this launcher.


It is still spurious to call lClient.ListInstances here, right after ensuring that the launcherDat holds an accurate list of instances. If we are not willing to trust what is in launcherDat.Instances then why maintain it at all?

There is a problem in proper maintenance of launcherDat.Instances: the true listing can change without any object in the Kube apiservers changing. So either we abandon maintaining a downstream copy, or fix its maintenance. Simplest fix is to query lClient.ListInstances every time we care and write into launcherDat.Instances --- which effectively gives up on maintaining a local copy of this answer. Another fix would be to introduce to the launcher a nonce (which can be fetched by a GET) that it changes every time the instance listing changes; prefix using launcherDat.Instances with a check that the nonce has not changed. Another version of this fix would be to have the launcher Pod itself maintain this nonce in a Pod annotation. Another possible fix would be to have the dual-pods controller periodically refresh its launcherDat.Instances even without any signal that there has been a change, and accept the consequent variable latency in reactions.

This looks tricky enough that it probably deserves its own PR. Would it make sense to address it separately?

Created Issue #375 to track this.

MikeSpreitzer

I left some independent comments. Also, nodeData.Launchers is and should remain only accessed while processing a Node, so no more synchronization is needed.

waltforme · 2026-03-20T17:11:26Z

Removed unnecessary synchronization for the access of nodeData.Launchers.

I left some independent comments. Also, nodeData.Launchers is and should remain only accessed while processing a Node, so no more synchronization is needed.

MikeSpreitzer · 2026-03-20T18:09:31Z

pkg/controller/dual-pods/controller.go

 	InferenceServers map[apitypes.UID]*serverData

 	// Launchers maps name of launcher-based server-providing Pod to launcherData.
-	// Access only while holding controller mutex.


Include a statement about what does synchronize access. Perhaps something like the following.

// Access only inside `nodeItem.process()`

Similarly, every func (below nodeItem.process()) that (directly or in a called func) accesses this field should have a comment stating the restriction. Perhaps something like the following.

// Call this func only from within `nodeItem.process()`

pkg/controller/dual-pods/controller.go

test/e2e/run-launcher-based.sh

pkg/controller/dual-pods/inference-server.go

MikeSpreitzer · 2026-03-20T19:57:09Z

pkg/controller/dual-pods/controller.go

+			// For launcher pods, use the pod's own UID and name as the item identifier
+			return infSvrItem{pod.UID, pod.Name}, infSvrItemLauncherBasedProvider


This is hacky. The first value being returned here is not really a reference to an inference server. It would be more accurate for this func's return type to be (infSvrItem, launcherPodItem, infSvrItemType) (and, yeah, generalize that latter type name).

This is not a critical problem.

MikeSpreitzer

I have finished another round of review.

MikeSpreitzer · 2026-03-24T06:13:17Z

pkg/controller/dual-pods/controller.go


 	// Launchers maps name of launcher-based server-providing Pod to launcherData.
-	// Access only while holding controller mutex.
+	// Access only inside the calling hierarchy that `nodeItem.process()` is the root caller.


"root" is not right, since it is not at the coldest end of the stack. Maybe something like the following?

Access only while nodeItem.process is on the call stack.

A subtree has a root as well.

I think the current expression and the suggested expression are equivalent.

pkg/controller/dual-pods/controller.go

pkg/controller/dual-pods/inference-server.go

MikeSpreitzer · 2026-03-24T06:55:39Z

pkg/controller/dual-pods/inference-server.go

-		if _, instExists := launcherDat.Instances[iscHash]; instExists {
+		hasSleepingInstance := false
+		for _, inst := range insts.Instances {
+			if inst.InstanceID == iscHash {


launcherPodAnys contains all launchers made from the right LauncherConfig object, right? Including ones with an awake child whose InstanceID == iscHash, right?

Not including an awake child because of the same logic as

llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go

Lines 430 to 431 in 0583aea

// They have to be sleeping, the Kube scheduler and kubelet would not have assigned the same

// node/gpus to the requester if there was another one awake.

Not exactly analogous. You cited logic for the direct (milestone 2) case, in which the index value is a hash that takes the node name and the GPU list into account. In that case it is reasonable to expect the mentioned Kube Pod scheduler behavior.

But here in the launcher-based case, the index is on the hash of the LauncherConfig augmented by the node name and "gpus=all". So this index is not so discriminating, and could include other existing launcher Pods with the right LauncherConfig and node but being used for an awake vllm instance using different GPU(s).

Info for GPUs are hashed into iscHash, right?

launcherPodAnys is the launchers that match the launcher hash produced in https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L476, extracted at https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L480, and used to get launcherPodAnys at https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L482 .
(The commit cited is the current rev of this PR, as I write this.)

Yes. The reason that I mentioned iscHash as well is that the two hashes work here together.

Oh, right. If the iscHash matches then the GPU sets are the same and so the instance being considered must be sleeping.

Maybe this is tricky enough to warrant a comment too.

MikeSpreitzer · 2026-03-24T07:00:11Z

pkg/controller/dual-pods/inference-server.go

+			newInstances[inst.InstanceID] = time.Now()
+		}
+	}
+


If an awake instance evaporated then this should cause the relevant infSvrItem to be enqueued.

This is tracked by #375.

test/e2e/run-launcher-based.sh

pkg/controller/dual-pods/controller.go

MikeSpreitzer · 2026-03-25T06:02:33Z

/ok-to-test

github-actions · 2026-03-25T06:02:46Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer

LGTM, provided the E2E test on OpenShift succeeds.

MikeSpreitzer

This needs to be rebased onto main so that the E2E test on OpenShift can succeed. (In main the workflow invokes a script that was added recently.)

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

waltforme · 2026-03-25T06:27:15Z

/ok-to-test

github-actions · 2026-03-25T06:27:23Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

waltforme · 2026-03-25T06:47:11Z

/ok-to-test

github-actions · 2026-03-25T06:47:20Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer · 2026-03-25T12:49:13Z

Taints are a thing in Kubernetes, FMA should work correctly in their presence.

MikeSpreitzer · 2026-03-25T14:34:59Z

/ok-to-test

github-actions · 2026-03-25T14:35:10Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer · 2026-03-25T14:47:23Z

The E2E test on OpenShift passed.

MikeSpreitzer

LGTM

Copilot AI review requested due to automatic review settings March 17, 2026 22:28

Copilot started reviewing on behalf of waltforme March 17, 2026 22:28 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

MikeSpreitzer reviewed Mar 18, 2026

View reviewed changes

pkg/controller/dual-pods/controller.go Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 18, 2026

View reviewed changes

pkg/controller/dual-pods/inference-server.go Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 18, 2026

View reviewed changes

pkg/controller/dual-pods/inference-server.go Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Mar 18, 2026

View reviewed changes