Skip to content

Consider port when selecting launcher#396

Merged
waltforme merged 3 commits intollm-d-incubation:mainfrom
waltforme:port-collision
Apr 1, 2026
Merged

Consider port when selecting launcher#396
waltforme merged 3 commits intollm-d-incubation:mainfrom
waltforme:port-collision

Conversation

@waltforme
Copy link
Copy Markdown
Collaborator

As shown in the title.

Copilot AI review requested due to automatic review settings April 1, 2026 17:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dual-pods controller’s launcher selection logic to account for vLLM instance port usage, and extends E2E coverage to validate behavior in a same-node port-collision scenario (with platform-aware skipping on OpenShift).

Changes:

  • Pass the desired vLLM port into launcher selection and skip launchers that already have an instance using that port.
  • Add an E2E test case that forces a same-node requester “collision” and asserts a new launcher is created.
  • Introduce E2E_PLATFORM to control which launcher-based E2E cases run (defaulting to OpenShift, with Kind explicitly enabled by the Kind runner).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/controller/dual-pods/inference-server.go Incorporates desired port into launcher selection and adds helper to parse instance ports from launcher-reported options.
test/e2e/test-cases.sh Adds E2E_PLATFORM and a “same-node port collision” E2E scenario; skips remaining cases on OpenShift.
test/e2e/run-launcher-based.sh Sets E2E_PLATFORM=kind so the full launcher-based suite runs in Kind.

Comment thread pkg/controller/dual-pods/inference-server.go Outdated
Comment thread pkg/controller/dual-pods/inference-server.go
@aavarghese
Copy link
Copy Markdown
Collaborator

Leaving an FYI: This approach will add significantly to FMA's actuation metric when port conflict happens especially since the launcher image is ~20Gb. But this fix is temporary since #363 is coming soon and this won't be a concern.

return &vllmCfg, nominalHash, nil
}

func getVLLMInstancePort(options string) (int32, error) {
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ick. Parsing Python command line options can not be done 100% reliably without running the Python code that defines the options. Utilizing #397 would be better.

This parsing is good enough for now. We can switch to using #397 in a later PR.

Comment thread test/e2e/test-cases.sh
if [ "$FMA_NAMESPACE" != debug ]; then
echo "Skipping the remaining test cases because of Issues 387 and 388" >&2
# TODO: stop skipping once Issues 387 is resolved
if [ "$E2E_PLATFORM" = "openshift" ]; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this PR resolve Issue #387 ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Another PR will address the GPU allocation on OpenShift using one of the proposed ideas in #387.

Comment thread test/e2e/test-cases.sh
collision_inst="${instlb}-collision"
collision_rs="my-request-collision-$instlb"

kubectl get rs "$rslb" -n "$NS" -o json \
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hackery is pretty complicated. It would be better if there were a simpler technique.

I have been thinking that mkobjs-openshift.sh should be factored or generalized so that this and the other hackery is not needed.

Why not just scale the existing ReplicaSet to 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling to 2 was the first thing tried and failed. It makes the rest of the test nondeterministic. For example, after this test case and scaled the ReplicaSet back to 1, which pod stays (and which gets deleted) is uncertain.

Comment thread test/e2e/test-cases.sh
# ---------------------------------------------------------------------------
# Same-Node Port Collision
# ---------------------------------------------------------------------------

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case needs comments. Either or both of (a) an outline for the test case as a whole and (b) comments on individual steps.

The outline for the whole test case might be something like the following.

Create a second ReplicaSet, specifying the same InferenceServerConfig. With both ReplicaSets scaled to 1, expect that each requester is bound with a distinct launcher. Delete the second ReplicaSet when done.

Comments on individual steps might be like in other test cases.

Comment thread test/e2e/test-cases.sh
kubectl delete rs "$collision_rs" -n "$NS" --wait=true
expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0"
kubectl delete pod "$collision_launcher" -n "$NS" --wait=true
expect '! kubectl get pods -n '"$NS"' -o name | grep -qw pod/'"$collision_launcher"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to check for absence by expecting failure of kubectl get pod ... ?

Comment thread test/e2e/test-cases.sh

kubectl delete rs "$collision_rs" -n "$NS" --wait=true
expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0"
kubectl delete pod "$collision_launcher" -n "$NS" --wait=true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the LPP object specifies 1 launcher and the colliding requester is gone, we can and should expect that the launcher population controller will delete this launcher.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can use further work, possibly in later PR(s).
This can be merged as progress.

Comment thread test/e2e/test-cases.sh

intro_case Same-Node Port Collision Creates New Launcher

collision_inst="${instlb}-collision"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this, collision_req, and collision_launcher to the things dumped in the EXIT trap.

@waltforme waltforme merged commit 976a361 into llm-d-incubation:main Apr 1, 2026
24 checks passed
@waltforme
Copy link
Copy Markdown
Collaborator Author

Merging and will continue this work in next PR(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants