Conversation
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
There was a problem hiding this comment.
Pull request overview
Updates the OpenShift E2E GitHub Actions workflow to avoid blocking on all launcher pods becoming Ready in shared clusters where some GPU nodes may be unschedulable/ineligible (e.g., due to taints), improving CI reliability while launcher scheduling behavior evolves.
Changes:
- Replaces
kubectl wait ... --for=condition=Readyon all launcher pods with a polling loop that proceeds once a minimum number of launcher pods are Ready. - Adds logging/error reporting to show how many launcher pods became Ready within the timeout window.
| # Temporary workaround: require only two ready launchers until the | ||
| # test accounts for tainted or otherwise ineligible GPU nodes. | ||
| READY_TARGET=2 |
There was a problem hiding this comment.
READY_TARGET is hard-coded to 2, which will make this workflow fail on clusters where the label selector yields fewer than 2 launcher pods (e.g., only 1 GPU node). Since you already compute GPU_NODES earlier in this step, consider setting READY_TARGET to the smaller of 2 and the expected/observed launcher count (and at least 1) so the wait can’t require more ready pods than can exist.
| # Temporary workaround: require only two ready launchers until the | |
| # test accounts for tainted or otherwise ineligible GPU nodes. | |
| READY_TARGET=2 | |
| # Temporary workaround: require only up to two ready launchers until the | |
| # test accounts for tainted or otherwise ineligible GPU nodes. | |
| # Set READY_TARGET to min(2, GPU_NODES) with a lower bound of 1 so we | |
| # never wait for more ready pods than can exist. | |
| READY_TARGET="$GPU_NODES" | |
| if [ "$READY_TARGET" -lt 1 ]; then | |
| READY_TARGET=1 | |
| elif [ "$READY_TARGET" -gt 2 ]; then | |
| READY_TARGET=2 | |
| fi |
There was a problem hiding this comment.
To be fixed later.
rubambiza
left a comment
There was a problem hiding this comment.
Do we have an issue to track the needed undoing of the hot fix? Otherwise, LGTM.
rubambiza
left a comment
There was a problem hiding this comment.
The previous comment is not a show-stopper.
|
@rubambiza: See #383 |
This PR relaxes a wait in the workflow that does E2E test on OpenShift. Currently the wait expects all launchers to eventually become ready. Currently one Node in the testing cluster is tainted to keep ordinary workload away. The wait currently succeeds because the launcher population controller is itself assigning its launchers to Nodes, bypassing the Kubernetes Pod scheduler. We want to stop doing that bypass. Before such a PR can pass testing, the test needs to be relaxed.
This PR relaxes the wait so that only 2 launchers need to become ready. This is not correct in general but the problem is not general, it is about the one cluster that we use for testing. Even so, we should do better in a later PR.
This change was authored by @waltforme in #380. I am putting it in a branch in the shared repo so that it can be tested before merging.