🌱 Hot fix to e2e test on Openshift by MikeSpreitzer · Pull Request #382 · llm-d-incubation/llm-d-fast-model-actuation

MikeSpreitzer · 2026-03-25T14:22:00Z

This PR relaxes a wait in the workflow that does E2E test on OpenShift. Currently the wait expects all launchers to eventually become ready. Currently one Node in the testing cluster is tainted to keep ordinary workload away. The wait currently succeeds because the launcher population controller is itself assigning its launchers to Nodes, bypassing the Kubernetes Pod scheduler. We want to stop doing that bypass. Before such a PR can pass testing, the test needs to be relaxed.

This PR relaxes the wait so that only 2 launchers need to become ready. This is not correct in general but the problem is not general, it is about the one cluster that we use for testing. Even so, we should do better in a later PR.

This change was authored by @waltforme in #380. I am putting it in a branch in the shared repo so that it can be tested before merging.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

Copilot

Pull request overview

Updates the OpenShift E2E GitHub Actions workflow to avoid blocking on all launcher pods becoming Ready in shared clusters where some GPU nodes may be unschedulable/ineligible (e.g., due to taints), improving CI reliability while launcher scheduling behavior evolves.

Changes:

Replaces kubectl wait ... --for=condition=Ready on all launcher pods with a polling loop that proceeds once a minimum number of launcher pods are Ready.
Adds logging/error reporting to show how many launcher pods became Ready within the timeout window.

Copilot · 2026-03-25T14:24:11Z

.github/workflows/ci-e2e-openshift.yaml

+          # Temporary workaround: require only two ready launchers until the
+          # test accounts for tainted or otherwise ineligible GPU nodes.
+          READY_TARGET=2


READY_TARGET is hard-coded to 2, which will make this workflow fail on clusters where the label selector yields fewer than 2 launcher pods (e.g., only 1 GPU node). Since you already compute GPU_NODES earlier in this step, consider setting READY_TARGET to the smaller of 2 and the expected/observed launcher count (and at least 1) so the wait can’t require more ready pods than can exist.

Suggested change

# Temporary workaround: require only two ready launchers until the

# test accounts for tainted or otherwise ineligible GPU nodes.

READY_TARGET=2

# Temporary workaround: require only up to two ready launchers until the

# test accounts for tainted or otherwise ineligible GPU nodes.

# Set READY_TARGET to min(2, GPU_NODES) with a lower bound of 1 so we

# never wait for more ready pods than can exist.

READY_TARGET="$GPU_NODES"

if [ "$READY_TARGET" -lt 1 ]; then

READY_TARGET=1

elif [ "$READY_TARGET" -gt 2 ]; then

READY_TARGET=2

fi

To be fixed later.

rubambiza

Do we have an issue to track the needed undoing of the hot fix? Otherwise, LGTM.

rubambiza

The previous comment is not a show-stopper.

MikeSpreitzer · 2026-03-25T14:33:30Z

@rubambiza: See #383

Hot fix to e2e test on Openshift

20cf1c8

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>

Copilot AI review requested due to automatic review settings March 25, 2026 14:22

Copilot started reviewing on behalf of MikeSpreitzer March 25, 2026 14:22 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

MikeSpreitzer mentioned this pull request Mar 25, 2026

Hot fix to e2e test on Openshift #380

Closed

rubambiza reviewed Mar 25, 2026

View reviewed changes

rubambiza self-requested a review March 25, 2026 14:29

rubambiza approved these changes Mar 25, 2026

View reviewed changes

MikeSpreitzer mentioned this pull request Mar 25, 2026

[Feature]: Improve launcher waiting in E2E test on OpenShift #383

Open

MikeSpreitzer merged commit 02705aa into main Mar 25, 2026
57 checks passed

MikeSpreitzer deleted the lax-launcher-wait branch March 25, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Hot fix to e2e test on Openshift#382

🌱 Hot fix to e2e test on Openshift#382
MikeSpreitzer merged 1 commit intomainfrom
lax-launcher-wait

MikeSpreitzer commented Mar 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

MikeSpreitzer Mar 25, 2026

Uh oh!

rubambiza left a comment

Uh oh!

rubambiza left a comment

Uh oh!

MikeSpreitzer commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-          # Temporary workaround: require only two ready launchers until the
-          # test accounts for tainted or otherwise ineligible GPU nodes.
-          READY_TARGET=2
+          # Temporary workaround: require only up to two ready launchers until the
+          # test accounts for tainted or otherwise ineligible GPU nodes.
+          # Set READY_TARGET to min(2, GPU_NODES) with a lower bound of 1 so we
+          # never wait for more ready pods than can exist.
+          READY_TARGET="$GPU_NODES"
+          if [ "$READY_TARGET" -lt 1 ]; then
+            READY_TARGET=1
+          elif [ "$READY_TARGET" -gt 2 ]; then
+            READY_TARGET=2
+          fi

Conversation

MikeSpreitzer commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

rubambiza left a comment

Choose a reason for hiding this comment

Uh oh!

rubambiza left a comment

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MikeSpreitzer commented Mar 25, 2026 •

edited

Loading