Skip to content

🌱 Hot fix to e2e test on Openshift#382

Merged
MikeSpreitzer merged 1 commit intomainfrom
lax-launcher-wait
Mar 25, 2026
Merged

🌱 Hot fix to e2e test on Openshift#382
MikeSpreitzer merged 1 commit intomainfrom
lax-launcher-wait

Conversation

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer commented Mar 25, 2026

This PR relaxes a wait in the workflow that does E2E test on OpenShift. Currently the wait expects all launchers to eventually become ready. Currently one Node in the testing cluster is tainted to keep ordinary workload away. The wait currently succeeds because the launcher population controller is itself assigning its launchers to Nodes, bypassing the Kubernetes Pod scheduler. We want to stop doing that bypass. Before such a PR can pass testing, the test needs to be relaxed.

This PR relaxes the wait so that only 2 launchers need to become ready. This is not correct in general but the problem is not general, it is about the one cluster that we use for testing. Even so, we should do better in a later PR.

This change was authored by @waltforme in #380. I am putting it in a branch in the shared repo so that it can be tested before merging.

Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Copilot AI review requested due to automatic review settings March 25, 2026 14:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the OpenShift E2E GitHub Actions workflow to avoid blocking on all launcher pods becoming Ready in shared clusters where some GPU nodes may be unschedulable/ineligible (e.g., due to taints), improving CI reliability while launcher scheduling behavior evolves.

Changes:

  • Replaces kubectl wait ... --for=condition=Ready on all launcher pods with a polling loop that proceeds once a minimum number of launcher pods are Ready.
  • Adds logging/error reporting to show how many launcher pods became Ready within the timeout window.

Comment on lines +680 to +682
# Temporary workaround: require only two ready launchers until the
# test accounts for tainted or otherwise ineligible GPU nodes.
READY_TARGET=2
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

READY_TARGET is hard-coded to 2, which will make this workflow fail on clusters where the label selector yields fewer than 2 launcher pods (e.g., only 1 GPU node). Since you already compute GPU_NODES earlier in this step, consider setting READY_TARGET to the smaller of 2 and the expected/observed launcher count (and at least 1) so the wait can’t require more ready pods than can exist.

Suggested change
# Temporary workaround: require only two ready launchers until the
# test accounts for tainted or otherwise ineligible GPU nodes.
READY_TARGET=2
# Temporary workaround: require only up to two ready launchers until the
# test accounts for tainted or otherwise ineligible GPU nodes.
# Set READY_TARGET to min(2, GPU_NODES) with a lower bound of 1 so we
# never wait for more ready pods than can exist.
READY_TARGET="$GPU_NODES"
if [ "$READY_TARGET" -lt 1 ]; then
READY_TARGET=1
elif [ "$READY_TARGET" -gt 2 ]; then
READY_TARGET=2
fi

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fixed later.

Copy link
Copy Markdown
Collaborator

@rubambiza rubambiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue to track the needed undoing of the hot fix? Otherwise, LGTM.

@rubambiza rubambiza self-requested a review March 25, 2026 14:29
Copy link
Copy Markdown
Collaborator

@rubambiza rubambiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comment is not a show-stopper.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

@rubambiza: See #383

@MikeSpreitzer MikeSpreitzer merged commit 02705aa into main Mar 25, 2026
57 checks passed
@MikeSpreitzer MikeSpreitzer deleted the lax-launcher-wait branch March 25, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants