handle LauncherConfig update events by osswangxining · Pull Request #374 · llm-d-incubation/llm-d-fast-model-actuation

osswangxining · 2026-03-22T05:13:15Z

Previously, when a LauncherConfig object was updated:

The launcher-populator controller only processed LauncherPopulationPolicy events

This Solution

Added LauncherConfig event handling to the controller to ensure:

When a LauncherConfig changes, all affected server-requesting pods are not affected
New launcher pods are created with the updated configuration
Old launcher pods (not bound to any requester) are eventually cleaned up

github-actions · 2026-03-22T05:13:24Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

osswangxining · 2026-03-22T05:15:21Z

@MikeSpreitzer @aavarghese @waltforme @rubambiza Please help check this logic for handling LauncherConfig update event.

BTW, we also need to handle this LauncherConfig update within duals-pod controller, @waltforme do you think that you would update the code or let me try?

waltforme · 2026-03-24T11:14:23Z

I think the concerned handling of LauncherConfig update is described as https://llm-d.slack.com/archives/C09TNPEFJUD/p1774153845072779?thread_ts=1773867574.189679&cid=C09TNPEFJUD, and also in #201 (comment). Am I right?

If so, IMO the handling can be done by any controllers because the necessary info (LauncherConfig name and hash) for the handling is contained within the launcher pods as annotations. Further, I think the launcher popular is the best candidate to do the handling, because "it is currently the only controller deleting launcher Pods and I think the system would be most stable if we keep that responsibility in one controller" as said by Mike in #201 (comment)

BTW, we also need to handle this LauncherConfig update within duals-pod controller, @waltforme do you think that you would update the code or let me try?

MikeSpreitzer · 2026-03-25T06:09:33Z

I think that it is important for limiting memory consumption that the launcher deletions on a given Node happen before the launcher creations.

MikeSpreitzer · 2026-03-25T06:33:21Z

If I understand this PR correctly, its purpose is to implement A3 for Q3 in #201 (comment) .

I realize now that A3 does not explicitly talk about what should happen while a referenced LauncherConfig does not exist. I suspect that the best choice would be for the launcher population controller to take no action regarding policy for that LauncherConfig.

osswangxining · 2026-03-25T11:09:30Z

If I understand this PR correctly, its purpose is to implement A3 for Q3 in #201 (comment) .

I realize now that A3 does not explicitly talk about what should happen while a referenced LauncherConfig does not exist. I suspect that the best choice would be for the launcher population controller to take no action regarding policy for that LauncherConfig.

When the LauncherConfig does not exist, the processing logic will return no reconciliation. Is this for the comment?

// Get the LauncherConfig
lc, err := ctl.lcLister.LauncherConfigs(ctl.namespace).Get(item.Name)
if err != nil {
if apierrors.IsNotFound(err) {
logger.Info("LauncherConfig does not exist yet, skipping reconciliation", "name", item.Name)
return nil, false
}

osswangxining · 2026-03-25T11:25:39Z

I think that it is important for limiting memory consumption that the launcher deletions on a given Node happen before the launcher creations.

You are right. I refactor the logic here for further review.

polish based on the comments polish based on the comments add vendor/ into gitignore

rubambiza

Left a few clarifying questions.

aavarghese · 2026-03-25T15:47:48Z

Is this a good time to add some minimal tests (unit tests or function tests) for this launcher config update to our existing test suite?

MikeSpreitzer · 2026-03-30T16:06:58Z

+
+// buildDesiredStateFromPolicies builds the desired state map from all policies.
+// If filterByConfig is provided, only policies referencing that config are considered.
+func (ctl *controller) buildDesiredStateFromPolicies(ctx context.Context, filterByConfig *string) (map[NodeLauncherKey]int32, error) {


There could be other LauncherConfig objects that are missing (either recently deleted or long missing). It is wrong to treat any one of them specially.

Sorry, I'm not sure I follow. Could you elaborate a bit more?"

This method does not need the filterByConfig *string parameter, because no LauncherConfig needs special treatment. Every LauncherConfig that does not exist should get no processing. Because we are using A3 to Q3 in #201 (comment) , this method needs to read (from the informer's local cache is preferred) every LauncherConfig object to get its current LauncherConfigSpec; these reads will reveal all of the non-existent LauncherConfig objects.

MikeSpreitzer · 2026-03-30T16:15:10Z

@osswangxining, in response to #374 (comment) : I had a few reasons for making that comment.

(A) To document and get explicit agreement on the idea.

(B) To approve having the controller do nothing about the population of launchers associated with a LauncherConfig that does not exist.

(C) To suggest that this is a pervasive concern, not just for the one LauncherConfig whose reference is being processed (as I also noted in a code comment).

MikeSpreitzer · 2026-03-30T16:20:21Z

This PR is still not addressing the main issue in Q3 of #374 (comment) : reacting to a change in the Spec of a LauncherConfig. To do that, a launcher's metadata will need to include not only the name of the LauncherConfig but also a hash of its Spec. The decision of which launchers to delete will need to include a comparison of the hash in the launcher's metadata with the hash of the current Spec of the LauncherConfig (and here is where the general concern with absence of the LauncherConfig object comes in).

osswangxining · 2026-04-01T12:09:23Z

This PR is still not addressing the main issue in Q3 of #374 (comment) : reacting to a change in the Spec of a LauncherConfig. To do that, a launcher's metadata will need to include not only the name of the LauncherConfig but also a hash of its Spec. The decision of which launchers to delete will need to include a comparison of the hash in the launcher's metadata with the hash of the current Spec of the LauncherConfig (and here is where the general concern with absence of the LauncherConfig object comes in).

When handling deletions, would it be sufficient to just compare the name? It might not be necessary to compare the spec content.

osswangxining · 2026-04-22T06:51:36Z

@MikeSpreitzer @rubambiza @waltforme @aavarghese I rebase together with the fix for all the comments. Looking forward to your reviews again, thanks a lot.

MikeSpreitzer · 2026-04-22T13:32:56Z

+		ctl.Queue.Add(item)
 	default:
-		ctl.enqueueLogger.V(5).Info("Notified of add of type of ignored object", "type", fmt.Sprintf("%T", obj))
+		ctl.enqueueLogger.V(5).Info("Notified of add of ignored object", "type", fmt.Sprintf("%T", obj))


The "type" is important to include in the log message, since the object's type is the critical factor here and appears in the key/value pairs. I think that the original is a commonly used English way of stating the intended idea, but perhaps is not strictly correct grammar. Perhaps it is better to be more expansive. Perhaps a better message would be "Notified of add of object of ignored type".

The same wording issue applies in the update and delete handlers.

Fine, I will replace with "Notified of add of object of ignored type". @rubambiza FYI.

MikeSpreitzer · 2026-04-22T13:40:14Z

+	if needsRequeue {
+		return nil, true
+	}
+	return nil, false


These can be simplified to just return nil, needsRequeue.

MikeSpreitzer · 2026-04-22T13:44:07Z

+	return nil, false
+}
+
+func (item lcItem) process(ctx context.Context, ctl *controller) (error, bool) {


The body of this method is now exactly the same as the body of lppItem.process. Rather than maintain two copies, I think that it would be better to have both simply call one function that has the common code.

MikeSpreitzer · 2026-04-22T14:06:46Z

 	// Use label selector to filter nodes
 	labelSelector, err := metav1.LabelSelectorAsSelector(&selector.LabelSelector)
 	if err != nil {
 		return nil, fmt.Errorf("failed to convert label selector: %w", err)


This error needs to get to the Errors of the LauncherPopulationPolicyStatus. That can be done in a later PR.

MikeSpreitzer · 2026-04-22T14:09:56Z

+// reconcileAllLaunchers adjusts all launcher pods according to final requirements.
+// It returns true if a requeue is needed (deletions were performed or are in progress),
+// so that creations happen only after deletions have taken effect.
+func (ctl *controller) reconcileAllLaunchers(ctx context.Context, desired map[NodeLauncherKey]DesiredStateEntry) (bool, error) {


FYI, everywhere else that is analogous returns the two values in the opposite order ((error, bool)). This is not a big deal.

MikeSpreitzer · 2026-04-22T14:16:50Z

+		if needsRequeue {
+			anyRequeueNeeded = true
 		}


Suggested change

if needsRequeue {

anyRequeueNeeded = true

}

anyRequeueNeeded = anyRequeueNeeded || needsRequeue

done thanks

MikeSpreitzer · 2026-04-22T14:20:30Z

+// reconcileLaunchersOnSingleNode handles all LauncherConfigs for a single node.
+// For each LauncherConfig, it does deletions immediately as they are identified
+// and remembers creations called for. If any deletions were performed (or are in
+// progress from a previous cycle), it returns true to request a requeue so that


This controller should be reacting to Pod deletions and creations (#448). If it did that then it would not need to requeue due to a deletion is in progress, because it will get notified when the deletion completes. Can be addressed in a later PR.

MikeSpreitzer · 2026-04-22T14:28:04Z

-			return fmt.Errorf("failed to create launchers: %w", err)
+			logger.Error(err, "Failed to get current launchers for config",
+				"node", nodeName, "config", key.LauncherConfigName)
+			continue


It might be worth a comment here documenting why it is OK to not do anything about this error. It is because the only error that can get here is from a failure of a lister, which I expect will never actually happen.

append the comments.

MikeSpreitzer · 2026-04-22T14:38:18Z

+		// BuildLauncherPodFromTemplate computes a hash of the fully built pod spec
+		// and stores it as the LauncherConfigHashAnnotationKey annotation.
+		nominalHash := ""
+		nominalPod, hashErr := utils.BuildLauncherPodFromTemplate(


hashErr is a strange name for this error, coming from a func named "BuildLauncherPodFromTemplate". The problem is not necessarily in the hashing. Looking in that func, I see that the only error that it will return is a complaint about not finding the inference server container in PodSpec.

fine, changed the name.

MikeSpreitzer · 2026-04-22T14:57:29Z

+	if totalCreated > 0 {
+		logger.Info("Completed creation of launchers",
+			"node", nodeName,
+			"created", totalCreated)


I would put this count in the unconditional log message below, rather than in a separate and conditional log message.

agree, updated.

MikeSpreitzer · 2026-04-22T15:00:33Z

-			Preconditions: &metav1.Preconditions{
-				ResourceVersion: &pod.ResourceVersion,
-			},


I think that this belongs in the Delete calls that happen only when the Pod is unbound, because that condition can change.

MikeSpreitzer · 2026-04-22T15:03:28Z

+			nominalHash = nominalPod.Annotations[string(common.LauncherConfigHashAnnotationKey)]
+		}
+
+		// Categorize current pods: separate live unbound current-spec pods from stale/unbound ones


I think that the coding here for deletions is more complex than necessary. Rather than compute counts and collections up front, simply compute the target number to delete then iterate through the launchers, deleting launchers to which either criterion applies (unbound and wrong config, unbound and target not yet reached).

MikeSpreitzer

I finished another round of review.

rubambiza · 2026-04-22T16:32:29Z

+						"config", countRule.LauncherConfigName, "policy", lpp.Name)
+					continue
+				}
+				return nil, fmt.Errorf("failed to get LauncherConfig %s: %w", countRule.LauncherConfigName, err)


I am confused why were returning here without having processed:

The count for each launcher

The remaining policies

A failure to read a LauncherConfig object for any reason other than its absence is a transient failure of some basic infrastructure --- e.g., inability to even make requests to the apiserver at all. The right reaction to an infrastructure failure when trying to read one particular LauncherConfig is not to ignore that one but rather to try the whole ensemble all over again later.

rubambiza · 2026-04-22T16:42:55Z

+	// Group by node to process each node separately
 	nodeGroups := make(map[string][]NodeLauncherKey)
 	for key := range desired {
 		nodeGroups[key.NodeName] = append(nodeGroups[key.NodeName], key)


What if nodeGroups[key.NodeName] does not exist yet?

In that case the read of nodeGroups[key.NodeName] will evaluate to the value nil of the right slice type, and append([]elttype(nil), something) returns []elttype{something}.

rubambiza · 2026-04-22T17:15:13Z

+					logger.Info("Stale launcher pod already deleted", "pod", pod.Name)
+					continue
+				}
+				return false, fmt.Errorf("failed to delete stale launcher pod %s: %w", pod.Name, err)


I'm curious, is the design intent here to return as soon as we encounter the first failed deletion with the hopes that the requeueing will capture the other pods in the next iteration?

In other words, why not just:

Flag that one of the deletions failed

Delete the rest that can be deleted

Retry the ones that do exist (but deletion failed in the current iteration) later

Same comment applies for line 465

Besides "not found", the errors that can arise here are concurrency conflicts and infrastructure failures. For both it is better to start over again later.

rubambiza · 2026-04-22T17:28:01Z

+	// Process each LauncherConfig on this node
 	for _, key := range keys {
-		desiredCount := desired[key]
+		entry := desired[key]


I can't quite put my finger on it, but it seems like there's duplication of information being passed through keys and desired, especially given the definition of NodeLauncherKey. The only reasonable assumption I could make is if one data structure is more stable than the other between function calls. Otherwise, I think passing in the nodeName and keys seem sufficient here.

I can put my finger on two bits of duplication here.

Every key in keys contains the node name already given separately in nodeName. I pointed out this duplication earlier, and @osswangxining said that he finds that it makes the code simpler overall.

Every entry contains a whole LauncherConfig object, including its .Name --- which is also in the key. I pointed out elsewhere that the entry only needs to hold the LauncherConfigSpec.

If this func is not given the LauncherConfigSpecs then it will have to read the LauncherConfig objects again, which would be bad because (a) it is unnecessary extra work, having just been done by the caller, and (b) requires adding complexity for dealing with the possibility that the second read returns a different result than the first (e.g., was present at first and is gone now).

refine in the new commit.

MikeSpreitzer · 2026-04-22T17:34:40Z

@osswangxining:

I rebase

I see no rebase here. The first commit contributed here is from March 22.

MikeSpreitzer · 2026-04-22T17:36:45Z

+// for a (Node, LauncherConfig) pair.
+type DesiredStateEntry struct {
+	Count          int32
+	LauncherConfig *fmav1alpha1.LauncherConfig


We do not need the whole LauncherConfig here. All that is really needed is the Spec.

osswangxining · 2026-04-23T13:29:51Z

@osswangxining:

I rebase

I see no rebase here. The first commit contributed here is from March 22.

I thought there were some conflicts before, while no any conflict so I continue to refine this file.

osswangxining · 2026-04-23T13:49:24Z

I've made improvements based on all the current comments. Please give it another review, thanks. @MikeSpreitzer @aavarghese @waltforme @rubambiza

osswangxining requested a review from rubambiza March 22, 2026 05:13

osswangxining requested review from MikeSpreitzer, aavarghese and waltforme March 22, 2026 05:13

handle LauncherConfig update events

43186c1

osswangxining force-pushed the handleLauncherConfigUpdate branch from 8e5bcad to 43186c1 Compare March 22, 2026 07:32

osswangxining mentioned this pull request Mar 23, 2026

Part 2 - LauncherPopulationPolicy & Dual-pods controller logic refinement #201

Open

MikeSpreitzer added this to the 3 - System with model swapping and sleep/wake milestone Mar 25, 2026

MikeSpreitzer reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

MikeSpreitzer reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

MikeSpreitzer reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

polish based on the comments

be25996

polish based on the comments polish based on the comments add vendor/ into gitignore

osswangxining force-pushed the handleLauncherConfigUpdate branch from 20e866e to be25996 Compare March 25, 2026 11:42

rubambiza reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

Comment thread pkg/controller/launcher-populator/populator.go

Comment thread pkg/controller/launcher-populator/populator.go

aavarghese reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

Comment thread .gitignore

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go

MikeSpreitzer reviewed Mar 30, 2026

View reviewed changes

refine log info

c6a515f

osswangxining requested review from MikeSpreitzer and rubambiza April 22, 2026 06:50

MikeSpreitzer reviewed Apr 22, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

MikeSpreitzer reviewed Apr 22, 2026

View reviewed changes

rubambiza reviewed Apr 22, 2026

View reviewed changes

MikeSpreitzer reviewed Apr 22, 2026

View reviewed changes

Comment thread pkg/controller/launcher-populator/populator.go Outdated

polish log & merge lppItem and lcItem processing logic

0a6f2a3

osswangxining mentioned this pull request Apr 23, 2026

Calling BuildLauncherPodFromTemplate improvement #450

Open

osswangxining mentioned this pull request Apr 23, 2026

Refine the status for LauncherConfig #451

Open

Polish based on all the comment

276a8da

osswangxining requested review from MikeSpreitzer and rubambiza April 23, 2026 13:47

Conversation

osswangxining commented Mar 22, 2026

This Solution

Uh oh!

github-actions Bot commented Mar 22, 2026

Uh oh!

osswangxining commented Mar 22, 2026

Uh oh!

waltforme commented Mar 24, 2026

Uh oh!

MikeSpreitzer commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MikeSpreitzer commented Mar 25, 2026

Uh oh!

osswangxining commented Mar 25, 2026

Uh oh!

osswangxining commented Mar 25, 2026

Uh oh!

rubambiza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aavarghese commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MikeSpreitzer commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osswangxining commented Apr 1, 2026

Uh oh!

osswangxining commented Apr 22, 2026

Uh oh!

MikeSpreitzer Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MikeSpreitzer commented Mar 30, 2026 •

edited

Loading

MikeSpreitzer commented Mar 30, 2026 •

edited

Loading

MikeSpreitzer Apr 22, 2026 •

edited

Loading

MikeSpreitzer Apr 22, 2026 •

edited

Loading

MikeSpreitzer Apr 22, 2026 •

edited

Loading

MikeSpreitzer Apr 22, 2026 •

edited

Loading

MikeSpreitzer Apr 22, 2026 •

edited

Loading