E2E: probe for a 2-GPU node before running test cases#429
E2E: probe for a 2-GPU node before running test cases#429MikeSpreitzer merged 4 commits intollm-d-incubation:mainfrom
Conversation
…ion in CI The "Same-Node Port Collision" test requires a free GPU on the test node beyond the one already held by req1. On a shared OpenShift cluster other workloads may consume all GPUs, causing the test to time out. Check availability before running and fail immediately with an explanatory message when no GPU is free. Also add a "Dump GPU allocation per node" debug step to the CI workflow so GPU saturation is visible in every run's logs. Fixes llm-d-incubation#422 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
It has been observed to fail at 120s but the namespace was gone when I later went looking for it. Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
There was a problem hiding this comment.
Pull request overview
This PR makes the E2E suite more resilient on shared GPU clusters by selecting a node that currently has 2 free GPUs before creating test objects, then pinning requester workloads to that node to avoid GPU-saturation flakes (Issue #422).
Changes:
- Add a “GPU probe” Pod in
test-cases.shto find a node with 2 available GPUs, then pin the requester ReplicaSet to that node. - Add
--node <node-name>tomkobjs.shandmkobjs-openshift.shto inject anodeSelectorat creation time. - Enhance OpenShift CI diagnostics by dumping GPU allocation per node and increasing namespace deletion timeout.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| test/e2e/test-cases.sh | Adds a 2-GPU probe step, passes --node to mkobjs, and adds an early-fail check when the collision test node is GPU-saturated. |
| test/e2e/mkobjs.sh | Adds --node flag support and conditionally injects a nodeSelector into the requester ReplicaSet. |
| test/e2e/mkobjs-openshift.sh | Adds --node flag support and conditionally injects a nodeSelector into the requester ReplicaSet (alongside runtimeClass support). |
| .github/workflows/ci-e2e-openshift.yaml | Adds a GPU allocation dump step and increases namespace deletion timeout to 180s. |
| spec: | ||
| containers: | ||
| - name: pause | ||
| image: registry.k8s.io/pause:3.10.2 | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: "2" | ||
| terminationGracePeriodSeconds: 0 |
There was a problem hiding this comment.
The new gpu-probe Pod requests nvidia.com/gpu, but it doesn’t honor the existing RUNTIME_CLASS_NAME mechanism used elsewhere in E2E (e.g., mkobjs-openshift.sh injects runtimeClassName when RUNTIME_CLASS_NAME is set). On clusters where GPU workloads require a specific runtimeClass (notably some OpenShift setups), the probe may be rejected or never reach Running, blocking the entire suite. Consider conditionally adding spec.runtimeClassName: $RUNTIME_CLASS_NAME to the probe manifest when the env var is set.
2eabbe8 to
35645dc
Compare
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
35645dc to
0ec0d9e
Compare
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
Before creating any test objects, launch a throwaway Pod that requests 2 GPUs. The scheduler places it on a node that actually has 2 GPUs free right now. Record that node, delete the probe, and pin the requester ReplicaSet to it via a new --node flag accepted by both mkobjs scripts. This prevents spurious failures on shared clusters where GPU availability is dynamic (Issue llm-d-incubation#422). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
0ec0d9e to
55cfdcd
Compare
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
Oh crud. I put this PR's branch in the wrong fork, so the modification to the test workflow was not tested here. However, you can see the result of testing that in #426 . |
|
The E2E test on OpenShift succeeded. |
Summary
--nodeflag to bothmkobjs.shandmkobjs-openshift.shso the caller can inject anodeSelectorinto the ReplicaSet at creation time.expectfunction toreturn 99instead ofexit 99, so that it can be used in anifstatement.Fixes #422
Test plan
test/e2e/run-launcher-based.sh) and verify the GPU probe selects a node and all subsequent test cases pass🤖 Generated with Claude Code
Note to reviewers
It will probably be easier to review each commit individually, because one of them just makes a bunch of changes in indentation and GitHub's diff display for all four together is very confused.