Consider port when selecting launcher by waltforme · Pull Request #396 · llm-d-incubation/llm-d-fast-model-actuation

waltforme · 2026-04-01T17:27:57Z

As shown in the title.

Copilot

Pull request overview

This PR updates the dual-pods controller’s launcher selection logic to account for vLLM instance port usage, and extends E2E coverage to validate behavior in a same-node port-collision scenario (with platform-aware skipping on OpenShift).

Changes:

Pass the desired vLLM port into launcher selection and skip launchers that already have an instance using that port.
Add an E2E test case that forces a same-node requester “collision” and asserts a new launcher is created.
Introduce E2E_PLATFORM to control which launcher-based E2E cases run (defaulting to OpenShift, with Kind explicitly enabled by the Kind runner).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`pkg/controller/dual-pods/inference-server.go`	Incorporates desired port into launcher selection and adds helper to parse instance ports from launcher-reported options.
`test/e2e/test-cases.sh`	Adds `E2E_PLATFORM` and a “same-node port collision” E2E scenario; skips remaining cases on OpenShift.
`test/e2e/run-launcher-based.sh`	Sets `E2E_PLATFORM=kind` so the full launcher-based suite runs in Kind.

aavarghese · 2026-04-01T19:35:51Z

Leaving an FYI: This approach will add significantly to FMA's actuation metric when port conflict happens especially since the launcher image is ~20Gb. But this fix is temporary since #363 is coming soon and this won't be a concern.

MikeSpreitzer · 2026-04-01T20:10:21Z

 	return &vllmCfg, nominalHash, nil
 }

+func getVLLMInstancePort(options string) (int32, error) {


Ick. Parsing Python command line options can not be done 100% reliably without running the Python code that defines the options. Utilizing #397 would be better.

This parsing is good enough for now. We can switch to using #397 in a later PR.

MikeSpreitzer · 2026-04-01T20:29:22Z

-if [ "$FMA_NAMESPACE" != debug ]; then
-    echo "Skipping the remaining test cases because of Issues 387 and 388" >&2
+# TODO: stop skipping once Issues 387 is resolved
+if [ "$E2E_PLATFORM" = "openshift" ]; then


Doesn't this PR resolve Issue #387 ?

No. Another PR will address the GPU allocation on OpenShift using one of the proposed ideas in #387.

MikeSpreitzer · 2026-04-01T20:32:39Z

+collision_inst="${instlb}-collision"
+collision_rs="my-request-collision-$instlb"
+
+kubectl get rs "$rslb" -n "$NS" -o json \


This hackery is pretty complicated. It would be better if there were a simpler technique.

I have been thinking that mkobjs-openshift.sh should be factored or generalized so that this and the other hackery is not needed.

Why not just scale the existing ReplicaSet to 2?

Scaling to 2 was the first thing tried and failed. It makes the rest of the test nondeterministic. For example, after this test case and scaled the ReplicaSet back to 1, which pod stays (and which gets deleted) is uncertain.

MikeSpreitzer · 2026-04-01T20:37:13Z

+# ---------------------------------------------------------------------------
+# Same-Node Port Collision
+# ---------------------------------------------------------------------------
+


This test case needs comments. Either or both of (a) an outline for the test case as a whole and (b) comments on individual steps.

The outline for the whole test case might be something like the following.

Create a second ReplicaSet, specifying the same InferenceServerConfig. With both ReplicaSets scaled to 1, expect that each requester is bound with a distinct launcher. Delete the second ReplicaSet when done.

Comments on individual steps might be like in other test cases.

MikeSpreitzer · 2026-04-01T20:38:47Z

+kubectl delete rs "$collision_rs" -n "$NS" --wait=true
+expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0"
+kubectl delete pod "$collision_launcher" -n "$NS" --wait=true
+expect '! kubectl get pods -n '"$NS"' -o name | grep -qw pod/'"$collision_launcher"


Wouldn't it be simpler to check for absence by expecting failure of kubectl get pod ... ?

MikeSpreitzer · 2026-04-01T20:42:22Z

+
+kubectl delete rs "$collision_rs" -n "$NS" --wait=true
+expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0"
+kubectl delete pod "$collision_launcher" -n "$NS" --wait=true


Since the LPP object specifies 1 launcher and the colliding requester is gone, we can and should expect that the launcher population controller will delete this launcher.

MikeSpreitzer

This can use further work, possibly in later PR(s).
This can be merged as progress.

MikeSpreitzer · 2026-04-01T20:57:11Z

+
+intro_case Same-Node Port Collision Creates New Launcher
+
+collision_inst="${instlb}-collision"


Add this, collision_req, and collision_launcher to the things dumped in the EXIT trap.

waltforme · 2026-04-01T21:22:32Z

Merging and will continue this work in next PR(s).

Consider port when selecting launcher

2f10a96

Copilot AI review requested due to automatic review settings April 1, 2026 17:27

Copilot started reviewing on behalf of waltforme April 1, 2026 17:28 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread pkg/controller/dual-pods/inference-server.go Outdated

Comment thread pkg/controller/dual-pods/inference-server.go

waltforme added 2 commits April 1, 2026 13:50

Correct the check for bound launcher

3f97236

Try to fix test

9c4f7c2

MikeSpreitzer reviewed Apr 1, 2026

View reviewed changes

MikeSpreitzer approved these changes Apr 1, 2026

View reviewed changes

MikeSpreitzer reviewed Apr 1, 2026

View reviewed changes

waltforme merged commit 976a361 into llm-d-incubation:main Apr 1, 2026
24 checks passed

waltforme deleted the port-collision branch April 1, 2026 21:27

aavarghese mentioned this pull request Apr 2, 2026

[Feature]: Convert e2e tests from bash to Ginkgo test framework #402

Open


		intro_case Same-Node Port Collision Creates New Launcher

		collision_inst="${instlb}-collision"

Conversation

waltforme commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

aavarghese commented Apr 1, 2026

Uh oh!

MikeSpreitzer Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

waltforme Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waltforme Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer left a comment

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

waltforme commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MikeSpreitzer Apr 1, 2026 •

edited

Loading

MikeSpreitzer Apr 1, 2026 •

edited

Loading