Create launcher pod in bound state#443
Create launcher pod in bound state#443waltforme wants to merge 11 commits intollm-d-incubation:mainfrom
Conversation
| "k8s.io/apimachinery/pkg/util/sets" | ||
| k8svalidation "k8s.io/apimachinery/pkg/util/validation" | ||
| "k8s.io/apimachinery/pkg/util/strategicpatch" | ||
| k8svalidation "k8s.io/apimachinery/pkg/util/validation" |
There was a problem hiding this comment.
Some linter suggested this reordering.
There was a problem hiding this comment.
I am surprised that this disorder even got merged. I thought that we had stuff both preventing and checking that. Worth an Issue.
If this fix were a separate PR then I would have already merged it.
There was a problem hiding this comment.
Pull request overview
Fixes a race in the dual-pods controller where newly created launcher Pods could be temporarily “unbound” and therefore eligible for deletion by the launcher-populator before the vLLM instance is created/bound.
Changes:
- Create launcher Pods with the requester annotation (and provider finalizer) already set, so they’re considered bound immediately.
- Add reconciliation logic to detect “pre-bound but not yet fully bound” launcher Pods (no dual label, no instance ID) and create/ensure the named vLLM instance before calling
bind(). - Refactor instance-ensure logic into a helper (
ensureNamedLauncherInstance).
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I left some independent comments.
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
On further thought, I think that the problem is a little bigger than fixing #421, and the solution needs to be addressed at the whole problem. Regarding the problem: note that the dual-pods controller can crash right after waking ( ) or creating ( ) a vllm instance in a launcher --- in which case the following call toctl.bind does not happen (before the crash).
Note also this unwritten invariant, designed in milestone 2 and built into the definition of launcher population control: vllm awake implies provider Pod is bound. Put another way: in a wake-up scenario the provider Pod is first bound and then then vllm instance is woken up, and in a go-to-sleep scenario the vllm instance is first put to sleep and then the provider Pod is unbound. But the current milestone 3 code violates this invariant in the causes where a suitable unbound launcher is sought ( ) and found.Let me make a couple of observations while warming up to the solution. Note that in , the&& providingPod.Status.PodIP != "" is unnecessary, because serverDat.InstanceID != "" implies providingPod.Status.PodIP != "" (in the current code). BTW, it would be nice to have invariants like that written in the code; in this case, a comment somewhere in the definition of serverData. Also, this particular invariant might go away, because of the following.
Note that the vllm instance ID is known as early as it is computed --- at , which precedes the calls tobind.
So here are my current thoughts about solution.
|
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
- address 409 when ensuring instance inside a created-as-bound launcher - skip waking up a freshly created instance - don't rely on the value of the dual label Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
…e stage Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
8bf5078 to
83a5514
Compare
| desiredLauncherPod.Annotations = utils.MapSet(desiredLauncherPod.Annotations, requesterAnnotationKey, string(requestingPod.UID)+" "+requestingPod.Name) | ||
| desiredLauncherPod.Labels = utils.MapSet(desiredLauncherPod.Labels, api.DualLabelName, requestingPod.Name) |
There was a problem hiding this comment.
FYI, BuildLauncherPodFromTemplate ensures that desiredLauncherPod.Annotations and desiredLauncherPod.Labels are not nil, so these statements can use bare indexing instead of utils.MapSet.
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
Tried to implement the idea from this comment by 045535f. Also created #455 for point 4 of the comment. |
|
|
||
| // IsInstanceAlreadyExistsError returns true when the launcher reports that the | ||
| // instance already exists (HTTP 409 Conflict). | ||
| func IsInstanceAlreadyExistsError(err error) bool { |
There was a problem hiding this comment.
Technically this func only determines whether the error is an HTTP 409 Conflict, and it is the caller that makes the deduction that a 409 implies that the vllm instance already exists.
|
|
||
| _, deletedStopped := syncResult.deletedStoppedInstanceIDs[serverDat.InstanceID] | ||
| if deletedStopped || !instancePresent { | ||
| if serverDat.InstanceExists != nil && *serverDat.InstanceExists { |
There was a problem hiding this comment.
To properly resolve the create/delete ambiguity (is this instance absent because it wasn't created yet or because it was created and then deleted?), we can not rely on the controller's memory --- the controller can crash and restart at any moment. We need to orchestrate clarity through the kube API objects. Following is the first idea that occurs to me. When reacting to a vllm instance being stopped, this controller should relay that to server-requesting Pod deletion before commanding deletion of the vllm instance.
This can be addressed in a follow-on PR if you prefer.
Unless and until the above is addressed, should the condition in this if statement start with deletedStopped || ?
There was a problem hiding this comment.
I tried to address this within this PR, using the suggested approach.
| return fmt.Errorf("failed to delete server-requesting Pod for stopped instance: %w", err), true | ||
| // InstanceExists is nil (unknown) — instance hasn't been created yet | ||
| // (bind-first path) or controller restarted and lost tracking. | ||
| // ensureNamedLauncherInstance is idempotent: GET first, create if not found. |
There was a problem hiding this comment.
That is no longer necessary. We just did ctl.syncLauncherInstances above, we know that this controller has not commanded creation of the instance. It is OK to unconditionally command creation of the instance here. In the very rare circumstance that something else concurrently commanded creation of the instance: returning a (error, true) will be adequate handling; alternatively, I think that it might be OK to assume that the concurrent thing was another copy of this controller and the concurrently-created instance is just what this copy wanted to happen.
| } | ||
|
|
||
| cfg, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs) | ||
| _, iscHash, err := ctl.configInferenceServer(isc, serverDat.GPUIDs) |
There was a problem hiding this comment.
Call this once, right after setting serverDat.GPUIDs, and be done with it. Save both outputs in serverDat.
There was a problem hiding this comment.
Removed the redundancy.
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
| } | ||
| logger.V(5).Info("Ensured vLLM instance", "instance_id", result.InstanceID, "status", result.Status) | ||
| // If ISC tracking annotations are missing (pre-bound pod), complete the bind metadata. | ||
| if _, bound := providingPod.Annotations[iscLabelKeysAnnotationKey]; !bound { |
There was a problem hiding this comment.
bound is not the right name for the condition here. The launcher is certainly bound. The question is whether the ISC labels and annotations have been propagated yet.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I left some individual comments.
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
Fixes #421