handle LauncherConfig update events#374
handle LauncherConfig update events#374osswangxining wants to merge 9 commits intollm-d-incubation:mainfrom
Conversation
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
|
@MikeSpreitzer @aavarghese @waltforme @rubambiza Please help check this logic for handling LauncherConfig update event. BTW, we also need to handle this LauncherConfig update within duals-pod controller, @waltforme do you think that you would update the code or let me try? |
8e5bcad to
43186c1
Compare
|
I think the concerned handling of LauncherConfig update is described as https://llm-d.slack.com/archives/C09TNPEFJUD/p1774153845072779?thread_ts=1773867574.189679&cid=C09TNPEFJUD, and also in #201 (comment). Am I right? If so, IMO the handling can be done by any controllers because the necessary info (LauncherConfig name and hash) for the handling is contained within the launcher pods as annotations. Further, I think the launcher popular is the best candidate to do the handling, because "it is currently the only controller deleting launcher Pods and I think the system would be most stable if we keep that responsibility in one controller" as said by Mike in #201 (comment)
|
|
I think that it is important for limiting memory consumption that the launcher deletions on a given Node happen before the launcher creations. |
|
If I understand this PR correctly, its purpose is to implement A3 for Q3 in #201 (comment) . I realize now that A3 does not explicitly talk about what should happen while a referenced LauncherConfig does not exist. I suspect that the best choice would be for the launcher population controller to take no action regarding policy for that LauncherConfig. |
When the LauncherConfig does not exist, the processing logic will return no reconciliation. Is this for the comment? // Get the LauncherConfig |
You are right. I refactor the logic here for further review. |
polish based on the comments polish based on the comments add vendor/ into gitignore
20e866e to
be25996
Compare
rubambiza
left a comment
There was a problem hiding this comment.
Left a few clarifying questions.
|
Is this a good time to add some minimal tests (unit tests or function tests) for this launcher config update to our existing test suite? |
|
|
||
| // buildDesiredStateFromPolicies builds the desired state map from all policies. | ||
| // If filterByConfig is provided, only policies referencing that config are considered. | ||
| func (ctl *controller) buildDesiredStateFromPolicies(ctx context.Context, filterByConfig *string) (map[NodeLauncherKey]int32, error) { |
There was a problem hiding this comment.
There could be other LauncherConfig objects that are missing (either recently deleted or long missing). It is wrong to treat any one of them specially.
There was a problem hiding this comment.
Sorry, I'm not sure I follow. Could you elaborate a bit more?"
There was a problem hiding this comment.
This method does not need the filterByConfig *string parameter, because no LauncherConfig needs special treatment. Every LauncherConfig that does not exist should get no processing. Because we are using A3 to Q3 in #201 (comment) , this method needs to read (from the informer's local cache is preferred) every LauncherConfig object to get its current LauncherConfigSpec; these reads will reveal all of the non-existent LauncherConfig objects.
|
@osswangxining, in response to #374 (comment) : I had a few reasons for making that comment. (A) To document and get explicit agreement on the idea. (B) To approve having the controller do nothing about the population of launchers associated with a LauncherConfig that does not exist. (C) To suggest that this is a pervasive concern, not just for the one LauncherConfig whose reference is being processed (as I also noted in a code comment). |
|
This PR is still not addressing the main issue in Q3 of #374 (comment) : reacting to a change in the Spec of a LauncherConfig. To do that, a launcher's metadata will need to include not only the name of the LauncherConfig but also a hash of its Spec. The decision of which launchers to delete will need to include a comparison of the hash in the launcher's metadata with the hash of the current Spec of the LauncherConfig (and here is where the general concern with absence of the LauncherConfig object comes in). |
When handling deletions, would it be sufficient to just compare the name? It might not be necessary to compare the spec content. |
|
@MikeSpreitzer @rubambiza @waltforme @aavarghese I rebase together with the fix for all the comments. Looking forward to your reviews again, thanks a lot. |
| ctl.Queue.Add(item) | ||
| default: | ||
| ctl.enqueueLogger.V(5).Info("Notified of add of type of ignored object", "type", fmt.Sprintf("%T", obj)) | ||
| ctl.enqueueLogger.V(5).Info("Notified of add of ignored object", "type", fmt.Sprintf("%T", obj)) |
There was a problem hiding this comment.
The "type" is important to include in the log message, since the object's type is the critical factor here and appears in the key/value pairs. I think that the original is a commonly used English way of stating the intended idea, but perhaps is not strictly correct grammar. Perhaps it is better to be more expansive. Perhaps a better message would be "Notified of add of object of ignored type".
The same wording issue applies in the update and delete handlers.
There was a problem hiding this comment.
Fine, I will replace with "Notified of add of object of ignored type". @rubambiza FYI.
| if needsRequeue { | ||
| return nil, true | ||
| } | ||
| return nil, false |
There was a problem hiding this comment.
These can be simplified to just return nil, needsRequeue.
| return nil, false | ||
| } | ||
|
|
||
| func (item lcItem) process(ctx context.Context, ctl *controller) (error, bool) { |
There was a problem hiding this comment.
The body of this method is now exactly the same as the body of lppItem.process. Rather than maintain two copies, I think that it would be better to have both simply call one function that has the common code.
| // Use label selector to filter nodes | ||
| labelSelector, err := metav1.LabelSelectorAsSelector(&selector.LabelSelector) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to convert label selector: %w", err) |
There was a problem hiding this comment.
This error needs to get to the Errors of the LauncherPopulationPolicyStatus. That can be done in a later PR.
| // reconcileAllLaunchers adjusts all launcher pods according to final requirements. | ||
| // It returns true if a requeue is needed (deletions were performed or are in progress), | ||
| // so that creations happen only after deletions have taken effect. | ||
| func (ctl *controller) reconcileAllLaunchers(ctx context.Context, desired map[NodeLauncherKey]DesiredStateEntry) (bool, error) { |
There was a problem hiding this comment.
FYI, everywhere else that is analogous returns the two values in the opposite order ((error, bool)). This is not a big deal.
| if needsRequeue { | ||
| anyRequeueNeeded = true | ||
| } |
There was a problem hiding this comment.
| if needsRequeue { | |
| anyRequeueNeeded = true | |
| } | |
| anyRequeueNeeded = anyRequeueNeeded || needsRequeue |
| // reconcileLaunchersOnSingleNode handles all LauncherConfigs for a single node. | ||
| // For each LauncherConfig, it does deletions immediately as they are identified | ||
| // and remembers creations called for. If any deletions were performed (or are in | ||
| // progress from a previous cycle), it returns true to request a requeue so that |
There was a problem hiding this comment.
This controller should be reacting to Pod deletions and creations (#448). If it did that then it would not need to requeue due to a deletion is in progress, because it will get notified when the deletion completes. Can be addressed in a later PR.
| return fmt.Errorf("failed to create launchers: %w", err) | ||
| logger.Error(err, "Failed to get current launchers for config", | ||
| "node", nodeName, "config", key.LauncherConfigName) | ||
| continue |
There was a problem hiding this comment.
It might be worth a comment here documenting why it is OK to not do anything about this error. It is because the only error that can get here is from a failure of a lister, which I expect will never actually happen.
There was a problem hiding this comment.
append the comments.
| // BuildLauncherPodFromTemplate computes a hash of the fully built pod spec | ||
| // and stores it as the LauncherConfigHashAnnotationKey annotation. | ||
| nominalHash := "" | ||
| nominalPod, hashErr := utils.BuildLauncherPodFromTemplate( |
There was a problem hiding this comment.
hashErr is a strange name for this error, coming from a func named "BuildLauncherPodFromTemplate". The problem is not necessarily in the hashing. Looking in that func, I see that the only error that it will return is a complaint about not finding the inference server container in PodSpec.
There was a problem hiding this comment.
fine, changed the name.
| if totalCreated > 0 { | ||
| logger.Info("Completed creation of launchers", | ||
| "node", nodeName, | ||
| "created", totalCreated) |
There was a problem hiding this comment.
I would put this count in the unconditional log message below, rather than in a separate and conditional log message.
| Preconditions: &metav1.Preconditions{ | ||
| ResourceVersion: &pod.ResourceVersion, | ||
| }, |
There was a problem hiding this comment.
I think that this belongs in the Delete calls that happen only when the Pod is unbound, because that condition can change.
| nominalHash = nominalPod.Annotations[string(common.LauncherConfigHashAnnotationKey)] | ||
| } | ||
|
|
||
| // Categorize current pods: separate live unbound current-spec pods from stale/unbound ones |
There was a problem hiding this comment.
I think that the coding here for deletions is more complex than necessary. Rather than compute counts and collections up front, simply compute the target number to delete then iterate through the launchers, deleting launchers to which either criterion applies (unbound and wrong config, unbound and target not yet reached).
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I finished another round of review.
| "config", countRule.LauncherConfigName, "policy", lpp.Name) | ||
| continue | ||
| } | ||
| return nil, fmt.Errorf("failed to get LauncherConfig %s: %w", countRule.LauncherConfigName, err) |
There was a problem hiding this comment.
I am confused why were returning here without having processed:
- The count for each launcher
- The remaining policies
There was a problem hiding this comment.
A failure to read a LauncherConfig object for any reason other than its absence is a transient failure of some basic infrastructure --- e.g., inability to even make requests to the apiserver at all. The right reaction to an infrastructure failure when trying to read one particular LauncherConfig is not to ignore that one but rather to try the whole ensemble all over again later.
| // Group by node to process each node separately | ||
| nodeGroups := make(map[string][]NodeLauncherKey) | ||
| for key := range desired { | ||
| nodeGroups[key.NodeName] = append(nodeGroups[key.NodeName], key) |
There was a problem hiding this comment.
What if nodeGroups[key.NodeName] does not exist yet?
There was a problem hiding this comment.
In that case the read of nodeGroups[key.NodeName] will evaluate to the value nil of the right slice type, and append([]elttype(nil), something) returns []elttype{something}.
| logger.Info("Stale launcher pod already deleted", "pod", pod.Name) | ||
| continue | ||
| } | ||
| return false, fmt.Errorf("failed to delete stale launcher pod %s: %w", pod.Name, err) |
There was a problem hiding this comment.
I'm curious, is the design intent here to return as soon as we encounter the first failed deletion with the hopes that the requeueing will capture the other pods in the next iteration?
In other words, why not just:
- Flag that one of the deletions failed
- Delete the rest that can be deleted
- Retry the ones that do exist (but deletion failed in the current iteration) later
Same comment applies for line 465
There was a problem hiding this comment.
Besides "not found", the errors that can arise here are concurrency conflicts and infrastructure failures. For both it is better to start over again later.
| // Process each LauncherConfig on this node | ||
| for _, key := range keys { | ||
| desiredCount := desired[key] | ||
| entry := desired[key] |
There was a problem hiding this comment.
I can't quite put my finger on it, but it seems like there's duplication of information being passed through keys and desired, especially given the definition of NodeLauncherKey. The only reasonable assumption I could make is if one data structure is more stable than the other between function calls. Otherwise, I think passing in the nodeName and keys seem sufficient here.
There was a problem hiding this comment.
I can put my finger on two bits of duplication here.
- Every key in
keyscontains the node name already given separately innodeName. I pointed out this duplication earlier, and @osswangxining said that he finds that it makes the code simpler overall. - Every
entrycontains a wholeLauncherConfigobject, including its.Name--- which is also in thekey. I pointed out elsewhere that theentryonly needs to hold theLauncherConfigSpec.
If this func is not given the LauncherConfigSpecs then it will have to read the LauncherConfig objects again, which would be bad because (a) it is unnecessary extra work, having just been done by the caller, and (b) requires adding complexity for dealing with the possibility that the second read returns a different result than the first (e.g., was present at first and is gone now).
There was a problem hiding this comment.
refine in the new commit.
| // for a (Node, LauncherConfig) pair. | ||
| type DesiredStateEntry struct { | ||
| Count int32 | ||
| LauncherConfig *fmav1alpha1.LauncherConfig |
There was a problem hiding this comment.
We do not need the whole LauncherConfig here. All that is really needed is the Spec.
I thought there were some conflicts before, while no any conflict so I continue to refine this file. |
|
I've made improvements based on all the current comments. Please give it another review, thanks. @MikeSpreitzer @aavarghese @waltforme @rubambiza |

Previously, when a
LauncherConfigobject was updated:launcher-populatorcontroller only processedLauncherPopulationPolicyeventsThis Solution
Added
LauncherConfigevent handling to the controller to ensure:LauncherConfigchanges, all affected server-requesting pods are not affected