Skip to content

handle LauncherConfig update events#374

Open
osswangxining wants to merge 9 commits intollm-d-incubation:mainfrom
osswangxining:handleLauncherConfigUpdate
Open

handle LauncherConfig update events#374
osswangxining wants to merge 9 commits intollm-d-incubation:mainfrom
osswangxining:handleLauncherConfigUpdate

Conversation

@osswangxining
Copy link
Copy Markdown
Member

Previously, when a LauncherConfig object was updated:

  • The launcher-populator controller only processed LauncherPopulationPolicy events

This Solution

Added LauncherConfig event handling to the controller to ensure:

  1. When a LauncherConfig changes, all affected server-requesting pods are not affected
  2. New launcher pods are created with the updated configuration
  3. Old launcher pods (not bound to any requester) are eventually cleaned up

@osswangxining osswangxining requested a review from rubambiza March 22, 2026 05:13
@github-actions
Copy link
Copy Markdown

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

@osswangxining
Copy link
Copy Markdown
Member Author

@MikeSpreitzer @aavarghese @waltforme @rubambiza Please help check this logic for handling LauncherConfig update event.

BTW, we also need to handle this LauncherConfig update within duals-pod controller, @waltforme do you think that you would update the code or let me try?

@waltforme
Copy link
Copy Markdown
Collaborator

I think the concerned handling of LauncherConfig update is described as https://llm-d.slack.com/archives/C09TNPEFJUD/p1774153845072779?thread_ts=1773867574.189679&cid=C09TNPEFJUD, and also in #201 (comment). Am I right?

If so, IMO the handling can be done by any controllers because the necessary info (LauncherConfig name and hash) for the handling is contained within the launcher pods as annotations. Further, I think the launcher popular is the best candidate to do the handling, because "it is currently the only controller deleting launcher Pods and I think the system would be most stable if we keep that responsibility in one controller" as said by Mike in #201 (comment)

BTW, we also need to handle this LauncherConfig update within duals-pod controller, @waltforme do you think that you would update the code or let me try?

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

I think that it is important for limiting memory consumption that the launcher deletions on a given Node happen before the launcher creations.

Comment thread pkg/controller/launcher-populator/populator.go Outdated
Comment thread pkg/controller/launcher-populator/populator.go Outdated
Comment thread pkg/controller/launcher-populator/populator.go Outdated
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

If I understand this PR correctly, its purpose is to implement A3 for Q3 in #201 (comment) .

I realize now that A3 does not explicitly talk about what should happen while a referenced LauncherConfig does not exist. I suspect that the best choice would be for the launcher population controller to take no action regarding policy for that LauncherConfig.

@osswangxining
Copy link
Copy Markdown
Member Author

If I understand this PR correctly, its purpose is to implement A3 for Q3 in #201 (comment) .

I realize now that A3 does not explicitly talk about what should happen while a referenced LauncherConfig does not exist. I suspect that the best choice would be for the launcher population controller to take no action regarding policy for that LauncherConfig.

When the LauncherConfig does not exist, the processing logic will return no reconciliation. Is this for the comment?

// Get the LauncherConfig
lc, err := ctl.lcLister.LauncherConfigs(ctl.namespace).Get(item.Name)
if err != nil {
if apierrors.IsNotFound(err) {
logger.Info("LauncherConfig does not exist yet, skipping reconciliation", "name", item.Name)
return nil, false
}

@osswangxining
Copy link
Copy Markdown
Member Author

I think that it is important for limiting memory consumption that the launcher deletions on a given Node happen before the launcher creations.

You are right. I refactor the logic here for further review.

polish based on the comments

polish based on the comments

add vendor/ into gitignore
@osswangxining osswangxining force-pushed the handleLauncherConfigUpdate branch from 20e866e to be25996 Compare March 25, 2026 11:42
Copy link
Copy Markdown
Collaborator

@rubambiza rubambiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few clarifying questions.

Comment thread pkg/controller/launcher-populator/populator.go Outdated
Comment thread pkg/controller/launcher-populator/populator.go
Comment thread pkg/controller/launcher-populator/populator.go
@aavarghese
Copy link
Copy Markdown
Collaborator

Is this a good time to add some minimal tests (unit tests or function tests) for this launcher config update to our existing test suite?

Comment thread pkg/controller/launcher-populator/populator.go Outdated
Comment thread pkg/controller/launcher-populator/populator.go Outdated
Comment thread .gitignore
Comment thread pkg/controller/launcher-populator/populator.go

// buildDesiredStateFromPolicies builds the desired state map from all policies.
// If filterByConfig is provided, only policies referencing that config are considered.
func (ctl *controller) buildDesiredStateFromPolicies(ctx context.Context, filterByConfig *string) (map[NodeLauncherKey]int32, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be other LauncherConfig objects that are missing (either recently deleted or long missing). It is wrong to treat any one of them specially.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow. Could you elaborate a bit more?"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not need the filterByConfig *string parameter, because no LauncherConfig needs special treatment. Every LauncherConfig that does not exist should get no processing. Because we are using A3 to Q3 in #201 (comment) , this method needs to read (from the informer's local cache is preferred) every LauncherConfig object to get its current LauncherConfigSpec; these reads will reveal all of the non-existent LauncherConfig objects.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

MikeSpreitzer commented Mar 30, 2026

@osswangxining, in response to #374 (comment) : I had a few reasons for making that comment.

(A) To document and get explicit agreement on the idea.

(B) To approve having the controller do nothing about the population of launchers associated with a LauncherConfig that does not exist.

(C) To suggest that this is a pervasive concern, not just for the one LauncherConfig whose reference is being processed (as I also noted in a code comment).

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

MikeSpreitzer commented Mar 30, 2026

This PR is still not addressing the main issue in Q3 of #374 (comment) : reacting to a change in the Spec of a LauncherConfig. To do that, a launcher's metadata will need to include not only the name of the LauncherConfig but also a hash of its Spec. The decision of which launchers to delete will need to include a comparison of the hash in the launcher's metadata with the hash of the current Spec of the LauncherConfig (and here is where the general concern with absence of the LauncherConfig object comes in).

@osswangxining
Copy link
Copy Markdown
Member Author

This PR is still not addressing the main issue in Q3 of #374 (comment) : reacting to a change in the Spec of a LauncherConfig. To do that, a launcher's metadata will need to include not only the name of the LauncherConfig but also a hash of its Spec. The decision of which launchers to delete will need to include a comparison of the hash in the launcher's metadata with the hash of the current Spec of the LauncherConfig (and here is where the general concern with absence of the LauncherConfig object comes in).

When handling deletions, would it be sufficient to just compare the name? It might not be necessary to compare the spec content.

@osswangxining
Copy link
Copy Markdown
Member Author

@MikeSpreitzer @rubambiza @waltforme @aavarghese I rebase together with the fix for all the comments. Looking forward to your reviews again, thanks a lot.

ctl.Queue.Add(item)
default:
ctl.enqueueLogger.V(5).Info("Notified of add of type of ignored object", "type", fmt.Sprintf("%T", obj))
ctl.enqueueLogger.V(5).Info("Notified of add of ignored object", "type", fmt.Sprintf("%T", obj))
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "type" is important to include in the log message, since the object's type is the critical factor here and appears in the key/value pairs. I think that the original is a commonly used English way of stating the intended idea, but perhaps is not strictly correct grammar. Perhaps it is better to be more expansive. Perhaps a better message would be "Notified of add of object of ignored type".

The same wording issue applies in the update and delete handlers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, I will replace with "Notified of add of object of ignored type". @rubambiza FYI.

Comment on lines +211 to +214
if needsRequeue {
return nil, true
}
return nil, false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be simplified to just return nil, needsRequeue.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

return nil, false
}

func (item lcItem) process(ctx context.Context, ctl *controller) (error, bool) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this method is now exactly the same as the body of lppItem.process. Rather than maintain two copies, I think that it would be better to have both simply call one function that has the common code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

// Use label selector to filter nodes
labelSelector, err := metav1.LabelSelectorAsSelector(&selector.LabelSelector)
if err != nil {
return nil, fmt.Errorf("failed to convert label selector: %w", err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error needs to get to the Errors of the LauncherPopulationPolicyStatus. That can be done in a later PR.

// reconcileAllLaunchers adjusts all launcher pods according to final requirements.
// It returns true if a requeue is needed (deletions were performed or are in progress),
// so that creations happen only after deletions have taken effect.
func (ctl *controller) reconcileAllLaunchers(ctx context.Context, desired map[NodeLauncherKey]DesiredStateEntry) (bool, error) {
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, everywhere else that is analogous returns the two values in the opposite order ((error, bool)). This is not a big deal.

Comment on lines 334 to 336
if needsRequeue {
anyRequeueNeeded = true
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if needsRequeue {
anyRequeueNeeded = true
}
anyRequeueNeeded = anyRequeueNeeded || needsRequeue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

// reconcileLaunchersOnSingleNode handles all LauncherConfigs for a single node.
// For each LauncherConfig, it does deletions immediately as they are identified
// and remembers creations called for. If any deletions were performed (or are in
// progress from a previous cycle), it returns true to request a requeue so that
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This controller should be reacting to Pod deletions and creations (#448). If it did that then it would not need to requeue due to a deletion is in progress, because it will get notified when the deletion completes. Can be addressed in a later PR.

return fmt.Errorf("failed to create launchers: %w", err)
logger.Error(err, "Failed to get current launchers for config",
"node", nodeName, "config", key.LauncherConfigName)
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth a comment here documenting why it is OK to not do anything about this error. It is because the only error that can get here is from a failure of a lister, which I expect will never actually happen.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append the comments.

// BuildLauncherPodFromTemplate computes a hash of the fully built pod spec
// and stores it as the LauncherConfigHashAnnotationKey annotation.
nominalHash := ""
nominalPod, hashErr := utils.BuildLauncherPodFromTemplate(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashErr is a strange name for this error, coming from a func named "BuildLauncherPodFromTemplate". The problem is not necessarily in the hashing. Looking in that func, I see that the only error that it will return is a complaint about not finding the inference server container in PodSpec.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine, changed the name.

Comment thread pkg/controller/launcher-populator/populator.go Outdated
if totalCreated > 0 {
logger.Info("Completed creation of launchers",
"node", nodeName,
"created", totalCreated)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this count in the unconditional log message below, rather than in a separate and conditional log message.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, updated.

Comment on lines -383 to -385
Preconditions: &metav1.Preconditions{
ResourceVersion: &pod.ResourceVersion,
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this belongs in the Delete calls that happen only when the Pod is unbound, because that condition can change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine

nominalHash = nominalPod.Annotations[string(common.LauncherConfigHashAnnotationKey)]
}

// Categorize current pods: separate live unbound current-spec pods from stale/unbound ones
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the coding here for deletions is more complex than necessary. Rather than compute counts and collections up front, simply compute the target number to delete then iterate through the launchers, deleting launchers to which either criterion applies (unbound and wrong config, unbound and target not yet reached).

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished another round of review.

"config", countRule.LauncherConfigName, "policy", lpp.Name)
continue
}
return nil, fmt.Errorf("failed to get LauncherConfig %s: %w", countRule.LauncherConfigName, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why were returning here without having processed:

  • The count for each launcher
  • The remaining policies

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failure to read a LauncherConfig object for any reason other than its absence is a transient failure of some basic infrastructure --- e.g., inability to even make requests to the apiserver at all. The right reaction to an infrastructure failure when trying to read one particular LauncherConfig is not to ignore that one but rather to try the whole ensemble all over again later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

// Group by node to process each node separately
nodeGroups := make(map[string][]NodeLauncherKey)
for key := range desired {
nodeGroups[key.NodeName] = append(nodeGroups[key.NodeName], key)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if nodeGroups[key.NodeName] does not exist yet?

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the read of nodeGroups[key.NodeName] will evaluate to the value nil of the right slice type, and append([]elttype(nil), something) returns []elttype{something}.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

logger.Info("Stale launcher pod already deleted", "pod", pod.Name)
continue
}
return false, fmt.Errorf("failed to delete stale launcher pod %s: %w", pod.Name, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, is the design intent here to return as soon as we encounter the first failed deletion with the hopes that the requeueing will capture the other pods in the next iteration?

In other words, why not just:

  • Flag that one of the deletions failed
  • Delete the rest that can be deleted
  • Retry the ones that do exist (but deletion failed in the current iteration) later

Same comment applies for line 465

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides "not found", the errors that can arise here are concurrency conflicts and infrastructure failures. For both it is better to start over again later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Comment thread pkg/controller/launcher-populator/populator.go
// Process each LauncherConfig on this node
for _, key := range keys {
desiredCount := desired[key]
entry := desired[key]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite put my finger on it, but it seems like there's duplication of information being passed through keys and desired, especially given the definition of NodeLauncherKey. The only reasonable assumption I could make is if one data structure is more stable than the other between function calls. Otherwise, I think passing in the nodeName and keys seem sufficient here.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put my finger on two bits of duplication here.

  1. Every key in keys contains the node name already given separately in nodeName. I pointed out this duplication earlier, and @osswangxining said that he finds that it makes the code simpler overall.
  2. Every entry contains a whole LauncherConfig object, including its .Name --- which is also in the key. I pointed out elsewhere that the entry only needs to hold the LauncherConfigSpec.

If this func is not given the LauncherConfigSpecs then it will have to read the LauncherConfig objects again, which would be bad because (a) it is unnecessary extra work, having just been done by the caller, and (b) requires adding complexity for dealing with the possibility that the second read returns a different result than the first (e.g., was present at first and is gone now).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refine in the new commit.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

MikeSpreitzer commented Apr 22, 2026

@osswangxining:

I rebase

I see no rebase here. The first commit contributed here is from March 22.

image

// for a (Node, LauncherConfig) pair.
type DesiredStateEntry struct {
Count int32
LauncherConfig *fmav1alpha1.LauncherConfig
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need the whole LauncherConfig here. All that is really needed is the Spec.

Comment thread pkg/controller/launcher-populator/populator.go Outdated
@osswangxining
Copy link
Copy Markdown
Member Author

@osswangxining:

I rebase

I see no rebase here. The first commit contributed here is from March 22.

I thought there were some conflicts before, while no any conflict so I continue to refine this file.

@osswangxining
Copy link
Copy Markdown
Member Author

I've made improvements based on all the current comments. Please give it another review, thanks. @MikeSpreitzer @aavarghese @waltforme @rubambiza

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants