Consider port when selecting launcher#396
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the dual-pods controller’s launcher selection logic to account for vLLM instance port usage, and extends E2E coverage to validate behavior in a same-node port-collision scenario (with platform-aware skipping on OpenShift).
Changes:
- Pass the desired vLLM port into launcher selection and skip launchers that already have an instance using that port.
- Add an E2E test case that forces a same-node requester “collision” and asserts a new launcher is created.
- Introduce
E2E_PLATFORMto control which launcher-based E2E cases run (defaulting to OpenShift, with Kind explicitly enabled by the Kind runner).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pkg/controller/dual-pods/inference-server.go |
Incorporates desired port into launcher selection and adds helper to parse instance ports from launcher-reported options. |
test/e2e/test-cases.sh |
Adds E2E_PLATFORM and a “same-node port collision” E2E scenario; skips remaining cases on OpenShift. |
test/e2e/run-launcher-based.sh |
Sets E2E_PLATFORM=kind so the full launcher-based suite runs in Kind. |
|
Leaving an FYI: This approach will add significantly to FMA's actuation metric when port conflict happens especially since the launcher image is ~20Gb. But this fix is temporary since #363 is coming soon and this won't be a concern. |
| return &vllmCfg, nominalHash, nil | ||
| } | ||
|
|
||
| func getVLLMInstancePort(options string) (int32, error) { |
| if [ "$FMA_NAMESPACE" != debug ]; then | ||
| echo "Skipping the remaining test cases because of Issues 387 and 388" >&2 | ||
| # TODO: stop skipping once Issues 387 is resolved | ||
| if [ "$E2E_PLATFORM" = "openshift" ]; then |
| collision_inst="${instlb}-collision" | ||
| collision_rs="my-request-collision-$instlb" | ||
|
|
||
| kubectl get rs "$rslb" -n "$NS" -o json \ |
There was a problem hiding this comment.
This hackery is pretty complicated. It would be better if there were a simpler technique.
I have been thinking that mkobjs-openshift.sh should be factored or generalized so that this and the other hackery is not needed.
Why not just scale the existing ReplicaSet to 2?
There was a problem hiding this comment.
Scaling to 2 was the first thing tried and failed. It makes the rest of the test nondeterministic. For example, after this test case and scaled the ReplicaSet back to 1, which pod stays (and which gets deleted) is uncertain.
| # --------------------------------------------------------------------------- | ||
| # Same-Node Port Collision | ||
| # --------------------------------------------------------------------------- | ||
|
|
There was a problem hiding this comment.
This test case needs comments. Either or both of (a) an outline for the test case as a whole and (b) comments on individual steps.
The outline for the whole test case might be something like the following.
Create a second ReplicaSet, specifying the same InferenceServerConfig. With both ReplicaSets scaled to 1, expect that each requester is bound with a distinct launcher. Delete the second ReplicaSet when done.
Comments on individual steps might be like in other test cases.
| kubectl delete rs "$collision_rs" -n "$NS" --wait=true | ||
| expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0" | ||
| kubectl delete pod "$collision_launcher" -n "$NS" --wait=true | ||
| expect '! kubectl get pods -n '"$NS"' -o name | grep -qw pod/'"$collision_launcher" |
There was a problem hiding this comment.
Wouldn't it be simpler to check for absence by expecting failure of kubectl get pod ... ?
|
|
||
| kubectl delete rs "$collision_rs" -n "$NS" --wait=true | ||
| expect "kubectl get pods -n $NS -o name -l app=dp-example,instance=$collision_inst | wc -l | grep -w 0" | ||
| kubectl delete pod "$collision_launcher" -n "$NS" --wait=true |
There was a problem hiding this comment.
Since the LPP object specifies 1 launcher and the colliding requester is gone, we can and should expect that the launcher population controller will delete this launcher.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
This can use further work, possibly in later PR(s).
This can be merged as progress.
|
|
||
| intro_case Same-Node Port Collision Creates New Launcher | ||
|
|
||
| collision_inst="${instlb}-collision" |
There was a problem hiding this comment.
Add this, collision_req, and collision_launcher to the things dumped in the EXIT trap.
|
Merging and will continue this work in next PR(s). |
As shown in the title.