E2E: probe for a 2-GPU node before running test cases by MikeSpreitzer · Pull Request #429 · llm-d-incubation/llm-d-fast-model-actuation

MikeSpreitzer · 2026-04-15T17:07:28Z

Summary

Before creating any test objects, launch a throwaway Pod requesting 2 GPUs so the scheduler places it on a node that actually has 2 GPUs free right now. Record that node, delete the probe, and pin the requester ReplicaSet to it.
Add a --node flag to both mkobjs.sh and mkobjs-openshift.sh so the caller can inject a nodeSelector into the ReplicaSet at creation time.
Change the expect function to return 99 instead of exit 99, so that it can be used in an if statement.
Includes earlier fixes from Fail collision E2E test early when GPU saturated #426: fail the collision test early when GPUs are saturated, dump GPU allocation in CI, and lengthen the namespace-deletion timeout.

Fixes #422

Test plan

Run the kind-based E2E suite (test/e2e/run-launcher-based.sh) and verify the GPU probe selects a node and all subsequent test cases pass
Run the OpenShift CI E2E workflow and verify the probe succeeds on the shared cluster

🤖 Generated with Claude Code

Note to reviewers

It will probably be easier to review each commit individually, because one of them just makes a bunch of changes in indentation and GitHub's diff display for all four together is very confused.

…ion in CI The "Same-Node Port Collision" test requires a free GPU on the test node beyond the one already held by req1. On a shared OpenShift cluster other workloads may consume all GPUs, causing the test to time out. Check availability before running and fail immediately with an explanatory message when no GPU is free. Also add a "Dump GPU allocation per node" debug step to the CI workflow so GPU saturation is visible in every run's logs. Fixes llm-d-incubation#422 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

It has been observed to fail at 120s but the namespace was gone when I later went looking for it. Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

github-actions · 2026-04-15T17:07:36Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Copilot

Pull request overview

This PR makes the E2E suite more resilient on shared GPU clusters by selecting a node that currently has 2 free GPUs before creating test objects, then pinning requester workloads to that node to avoid GPU-saturation flakes (Issue #422).

Changes:

Add a “GPU probe” Pod in test-cases.sh to find a node with 2 available GPUs, then pin the requester ReplicaSet to that node.
Add --node <node-name> to mkobjs.sh and mkobjs-openshift.sh to inject a nodeSelector at creation time.
Enhance OpenShift CI diagnostics by dumping GPU allocation per node and increasing namespace deletion timeout.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
test/e2e/test-cases.sh	Adds a 2-GPU probe step, passes `--node` to mkobjs, and adds an early-fail check when the collision test node is GPU-saturated.
test/e2e/mkobjs.sh	Adds `--node` flag support and conditionally injects a `nodeSelector` into the requester ReplicaSet.
test/e2e/mkobjs-openshift.sh	Adds `--node` flag support and conditionally injects a `nodeSelector` into the requester ReplicaSet (alongside runtimeClass support).
.github/workflows/ci-e2e-openshift.yaml	Adds a GPU allocation dump step and increases namespace deletion timeout to 180s.

Copilot · 2026-04-15T17:11:28Z

+spec:
+  containers:
+  - name: pause
+    image: registry.k8s.io/pause:3.10.2
+    resources:
+      limits:
+        nvidia.com/gpu: "2"
+  terminationGracePeriodSeconds: 0


The new gpu-probe Pod requests nvidia.com/gpu, but it doesn’t honor the existing RUNTIME_CLASS_NAME mechanism used elsewhere in E2E (e.g., mkobjs-openshift.sh injects runtimeClassName when RUNTIME_CLASS_NAME is set). On clusters where GPU workloads require a specific runtimeClass (notably some OpenShift setups), the probe may be rejected or never reach Running, blocking the entire suite. Consider conditionally adding spec.runtimeClassName: $RUNTIME_CLASS_NAME to the probe manifest when the env var is set.

MikeSpreitzer · 2026-04-15T17:22:29Z

/ok-to-test

github-actions · 2026-04-15T17:22:42Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer · 2026-04-15T18:02:21Z

/ok-to-test

github-actions · 2026-04-15T18:02:31Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Before creating any test objects, launch a throwaway Pod that requests 2 GPUs. The scheduler places it on a node that actually has 2 GPUs free right now. Record that node, delete the probe, and pin the requester ReplicaSet to it via a new --node flag accepted by both mkobjs scripts. This prevents spurious failures on shared clusters where GPU availability is dynamic (Issue llm-d-incubation#422). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

MikeSpreitzer · 2026-04-15T18:13:54Z

/ok-to-test

github-actions · 2026-04-15T18:14:16Z

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

MikeSpreitzer · 2026-04-15T18:56:07Z

Oh crud. I put this PR's branch in the wrong fork, so the modification to the test workflow was not tested here. However, you can see the result of testing that in #426 .

MikeSpreitzer · 2026-04-15T20:26:33Z

The E2E test on OpenShift succeeded.

MikeSpreitzer and others added 3 commits April 15, 2026 12:09

Indent else body in collision test GPU availability check

8d273fe

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

Lengthen namespace deletion timeout

2993df4

It has been observed to fail at 120s but the namespace was gone when I later went looking for it. Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>

Copilot AI review requested due to automatic review settings April 15, 2026 17:07

Copilot started reviewing on behalf of MikeSpreitzer April 15, 2026 17:07 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

MikeSpreitzer force-pushed the fix-422-probe-node branch from 2eabbe8 to 35645dc Compare April 15, 2026 17:16

MikeSpreitzer force-pushed the fix-422-probe-node branch from 35645dc to 0ec0d9e Compare April 15, 2026 17:38

MikeSpreitzer force-pushed the fix-422-probe-node branch from 0ec0d9e to 55cfdcd Compare April 15, 2026 18:10

MikeSpreitzer requested review from aavarghese, diegocastanibm, dumb0002, rubambiza and waltforme April 15, 2026 18:14

MikeSpreitzer mentioned this pull request Apr 15, 2026

Fail collision E2E test early when GPU saturated #426

Closed

3 tasks

waltforme approved these changes Apr 16, 2026

View reviewed changes

MikeSpreitzer merged commit 750d1b2 into llm-d-incubation:main Apr 16, 2026
26 checks passed

MikeSpreitzer deleted the fix-422-probe-node branch April 16, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E: probe for a 2-GPU node before running test cases#429

E2E: probe for a 2-GPU node before running test cases#429
MikeSpreitzer merged 4 commits intollm-d-incubation:mainfrom
MikeSpreitzer:fix-422-probe-node

MikeSpreitzer commented Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

MikeSpreitzer Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MikeSpreitzer commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Note to reviewers

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

MikeSpreitzer commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MikeSpreitzer commented Apr 15, 2026 •

edited

Loading