Sync unbound launcher-based server-providing pods#362
Sync unbound launcher-based server-providing pods#362MikeSpreitzer merged 9 commits intollm-d-incubation:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR brings unbound launcher-based server-providing pods into the dual-pods controller’s node-local reconciliation so the controller can resync launcher instance state (e.g., after restart) and re-drive requesters affected by launcher lifecycle events. It also extends the launcher-based E2E suite to cover controller restart recovery and unbound launcher deletion cleanup.
Changes:
- Add a node-scoped
launcherPodItemwork item and asyncLauncherInstancespath to refresh in-memory launcher instance state. - Adjust node reconciliation to process launcher sync items before other node-local items.
- Extend
test/e2e/run-launcher-based.shwith controller restart recovery and unbound launcher deletion scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/e2e/run-launcher-based.sh | Adds E2E scenarios for controller restart state recovery and unbound launcher deletion cleanup. |
| pkg/controller/dual-pods/inference-server.go | Adds launcher sync work item processing and a helper to sync launcher instances into controller state; prioritizes launcher items in node reconciliation. |
| pkg/controller/dual-pods/controller.go | Adds launcherPodItem, updates pod event handling to enqueue launcher sync, and adds enqueueRequestersOnNode / clearLauncherData. |
| } | ||
|
|
||
| // Query instances from this launcher | ||
| // Query instances from this launcher. |
There was a problem hiding this comment.
It is still spurious to call lClient.ListInstances here, right after ensuring that the launcherDat holds an accurate list of instances. If we are not willing to trust what is in launcherDat.Instances then why maintain it at all?
There is a problem in proper maintenance of launcherDat.Instances: the true listing can change without any object in the Kube apiservers changing. So either we abandon maintaining a downstream copy, or fix its maintenance. Simplest fix is to query lClient.ListInstances every time we care and write into launcherDat.Instances --- which effectively gives up on maintaining a local copy of this answer. Another fix would be to introduce to the launcher a nonce (which can be fetched by a GET) that it changes every time the instance listing changes; prefix using launcherDat.Instances with a check that the nonce has not changed. Another version of this fix would be to have the launcher Pod itself maintain this nonce in a Pod annotation. Another possible fix would be to have the dual-pods controller periodically refresh its launcherDat.Instances even without any signal that there has been a change, and accept the consequent variable latency in reactions.
There was a problem hiding this comment.
This looks tricky enough that it probably deserves its own PR. Would it make sense to address it separately?
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I left some independent comments. Also, nodeData.Launchers is and should remain only accessed while processing a Node, so no more synchronization is needed.
|
Removed unnecessary synchronization for the access of
|
| InferenceServers map[apitypes.UID]*serverData | ||
|
|
||
| // Launchers maps name of launcher-based server-providing Pod to launcherData. | ||
| // Access only while holding controller mutex. |
There was a problem hiding this comment.
Include a statement about what does synchronize access. Perhaps something like the following.
// Access only inside `nodeItem.process()`
Similarly, every func (below nodeItem.process()) that (directly or in a called func) accesses this field should have a comment stating the restriction. Perhaps something like the following.
// Call this func only from within `nodeItem.process()`
| // For launcher pods, use the pod's own UID and name as the item identifier | ||
| return infSvrItem{pod.UID, pod.Name}, infSvrItemLauncherBasedProvider |
There was a problem hiding this comment.
This is hacky. The first value being returned here is not really a reference to an inference server. It would be more accurate for this func's return type to be (infSvrItem, launcherPodItem, infSvrItemType) (and, yeah, generalize that latter type name).
This is not a critical problem.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I have finished another round of review.
|
|
||
| // Launchers maps name of launcher-based server-providing Pod to launcherData. | ||
| // Access only while holding controller mutex. | ||
| // Access only inside the calling hierarchy that `nodeItem.process()` is the root caller. |
There was a problem hiding this comment.
"root" is not right, since it is not at the coldest end of the stack. Maybe something like the following?
Access only while
nodeItem.processis on the call stack.
There was a problem hiding this comment.
A subtree has a root as well.
I think the current expression and the suggested expression are equivalent.
| if _, instExists := launcherDat.Instances[iscHash]; instExists { | ||
| hasSleepingInstance := false | ||
| for _, inst := range insts.Instances { | ||
| if inst.InstanceID == iscHash { |
There was a problem hiding this comment.
launcherPodAnys contains all launchers made from the right LauncherConfig object, right? Including ones with an awake child whose InstanceID == iscHash, right?
There was a problem hiding this comment.
Not including an awake child because of the same logic as
llm-d-fast-model-actuation/pkg/controller/dual-pods/inference-server.go
Lines 430 to 431 in 0583aea
There was a problem hiding this comment.
Not exactly analogous. You cited logic for the direct (milestone 2) case, in which the index value is a hash that takes the node name and the GPU list into account. In that case it is reasonable to expect the mentioned Kube Pod scheduler behavior.
But here in the launcher-based case, the index is on the hash of the LauncherConfig augmented by the node name and "gpus=all". So this index is not so discriminating, and could include other existing launcher Pods with the right LauncherConfig and node but being used for an awake vllm instance using different GPU(s).
There was a problem hiding this comment.
Info for GPUs are hashed into iscHash, right?
There was a problem hiding this comment.
launcherPodAnys is the launchers that match the launcher hash produced in https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L476, extracted at https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L480, and used to get launcherPodAnys at https://github.com/waltforme/llm-d-fast-model-actuation/blob/42c4ee8d59cb35f109119e78a5035b41a03cad91/pkg/controller/dual-pods/inference-server.go#L482 .
(The commit cited is the current rev of this PR, as I write this.)
There was a problem hiding this comment.
Yes. The reason that I mentioned iscHash as well is that the two hashes work here together.
There was a problem hiding this comment.
Oh, right. If the iscHash matches then the GPU sets are the same and so the instance being considered must be sleeping.
There was a problem hiding this comment.
Maybe this is tricky enough to warrant a comment too.
| newInstances[inst.InstanceID] = time.Now() | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
If an awake instance evaporated then this should cause the relevant infSvrItem to be enqueued.
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
MikeSpreitzer
left a comment
There was a problem hiding this comment.
LGTM, provided the E2E test on OpenShift succeeds.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
This needs to be rebased onto main so that the E2E test on OpenShift can succeed. (In main the workflow invokes a script that was added recently.)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
Taints are a thing in Kubernetes, FMA should work correctly in their presence. |
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
The E2E test on OpenShift passed. |
This PR takes unbound launcher pods into the dual-pods controller’s node-local reconciliation flow so launcher readiness, deletion, and restart recovery update internal instance state and re-drive affected requesters.
This PR also extends launcher-based e2e coverage.