Skip to content

Commit 20cf1c8

Browse files
committed
Hot fix to e2e test on Openshift
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
1 parent ccafbeb commit 20cf1c8

File tree

1 file changed

+27
-7
lines changed

1 file changed

+27
-7
lines changed

.github/workflows/ci-e2e-openshift.yaml

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -668,15 +668,35 @@ jobs:
668668
ELAPSED=$((ELAPSED + 5))
669669
done
670670
671-
# 1. Wait for launcher to be Ready.
672-
# The controller will not bind until the launcher is ready, so waiting here
671+
# 1. Wait for at least two launcher pods to be Ready.
672+
# The controller will not bind until a launcher is ready, so waiting here
673673
# makes it easier to diagnose problems that prevent readiness.
674674
# Launcher image is ~20GB, so allow extra time for uncached pulls.
675-
# There may be more than one launcher pod if the controller created extras,
676-
# so wait for all of them using the label selector.
677-
echo "Waiting for launcher pod(s) to be Ready..."
678-
kubectl wait pods --for=condition=Ready -n "$FMA_NAMESPACE" \
679-
-l "dual-pods.llm-d.ai/launcher-config-name=$LC" --timeout=600s
675+
# Some GPU nodes may be ineligible for scheduling on shared clusters, so
676+
# require a smaller healthy subset instead of every created launcher pod.
677+
echo "Waiting for at least two launcher pod(s) to be Ready..."
678+
ELAPSED=0
679+
READY_LAUNCHERS=0
680+
# Temporary workaround: require only two ready launchers until the
681+
# test accounts for tainted or otherwise ineligible GPU nodes.
682+
READY_TARGET=2
683+
while true; do
684+
READY_LAUNCHERS=$(kubectl get pods -n "$FMA_NAMESPACE" \
685+
-l "dual-pods.llm-d.ai/launcher-config-name=$LC" -o json \
686+
| jq '[.items[] | select(.status.conditions[]? | select(.type == "Ready" and .status == "True"))] | length')
687+
if [ "$READY_LAUNCHERS" -ge "$READY_TARGET" ]; then
688+
echo "$READY_LAUNCHERS launcher pod(s) are Ready"
689+
kubectl get pods -n "$FMA_NAMESPACE" \
690+
-l "dual-pods.llm-d.ai/launcher-config-name=$LC" -o wide
691+
break
692+
fi
693+
if [ "$ELAPSED" -ge 600 ]; then
694+
echo "::error::Fewer than ${READY_TARGET} launcher pod(s) became Ready within 600s (ready: $READY_LAUNCHERS)"
695+
exit 1
696+
fi
697+
sleep 5
698+
ELAPSED=$((ELAPSED + 5))
699+
done
680700
681701
# 2. Verify launcher-to-requester binding.
682702
# After launcher is ready, the controller binds by setting dual labels.

0 commit comments

Comments
 (0)