Skip to content

E2E: probe for a 2-GPU node before running test cases#429

Merged
MikeSpreitzer merged 4 commits intollm-d-incubation:mainfrom
MikeSpreitzer:fix-422-probe-node
Apr 16, 2026
Merged

E2E: probe for a 2-GPU node before running test cases#429
MikeSpreitzer merged 4 commits intollm-d-incubation:mainfrom
MikeSpreitzer:fix-422-probe-node

Conversation

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer commented Apr 15, 2026

Summary

  • Before creating any test objects, launch a throwaway Pod requesting 2 GPUs so the scheduler places it on a node that actually has 2 GPUs free right now. Record that node, delete the probe, and pin the requester ReplicaSet to it.
  • Add a --node flag to both mkobjs.sh and mkobjs-openshift.sh so the caller can inject a nodeSelector into the ReplicaSet at creation time.
  • Change the expect function to return 99 instead of exit 99, so that it can be used in an if statement.
  • Includes earlier fixes from Fail collision E2E test early when GPU saturated #426: fail the collision test early when GPUs are saturated, dump GPU allocation in CI, and lengthen the namespace-deletion timeout.

Fixes #422

Test plan

  • Run the kind-based E2E suite (test/e2e/run-launcher-based.sh) and verify the GPU probe selects a node and all subsequent test cases pass
  • Run the OpenShift CI E2E workflow and verify the probe succeeds on the shared cluster

🤖 Generated with Claude Code

Note to reviewers

It will probably be easier to review each commit individually, because one of them just makes a bunch of changes in indentation and GitHub's diff display for all four together is very confused.

MikeSpreitzer and others added 3 commits April 15, 2026 12:09
…ion in CI

The "Same-Node Port Collision" test requires a free GPU on the test node
beyond the one already held by req1. On a shared OpenShift cluster other
workloads may consume all GPUs, causing the test to time out. Check
availability before running and fail immediately with an explanatory
message when no GPU is free.

Also add a "Dump GPU allocation per node" debug step to the CI workflow
so GPU saturation is visible in every run's logs.

Fixes llm-d-incubation#422

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
It has been observed to fail at 120s
but the namespace was gone when I later went looking for it.

Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
Copilot AI review requested due to automatic review settings April 15, 2026 17:07
@github-actions
Copy link
Copy Markdown

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the E2E suite more resilient on shared GPU clusters by selecting a node that currently has 2 free GPUs before creating test objects, then pinning requester workloads to that node to avoid GPU-saturation flakes (Issue #422).

Changes:

  • Add a “GPU probe” Pod in test-cases.sh to find a node with 2 available GPUs, then pin the requester ReplicaSet to that node.
  • Add --node <node-name> to mkobjs.sh and mkobjs-openshift.sh to inject a nodeSelector at creation time.
  • Enhance OpenShift CI diagnostics by dumping GPU allocation per node and increasing namespace deletion timeout.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
test/e2e/test-cases.sh Adds a 2-GPU probe step, passes --node to mkobjs, and adds an early-fail check when the collision test node is GPU-saturated.
test/e2e/mkobjs.sh Adds --node flag support and conditionally injects a nodeSelector into the requester ReplicaSet.
test/e2e/mkobjs-openshift.sh Adds --node flag support and conditionally injects a nodeSelector into the requester ReplicaSet (alongside runtimeClass support).
.github/workflows/ci-e2e-openshift.yaml Adds a GPU allocation dump step and increases namespace deletion timeout to 180s.

Comment thread test/e2e/test-cases.sh
Comment on lines +123 to +130
spec:
containers:
- name: pause
image: registry.k8s.io/pause:3.10.2
resources:
limits:
nvidia.com/gpu: "2"
terminationGracePeriodSeconds: 0
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new gpu-probe Pod requests nvidia.com/gpu, but it doesn’t honor the existing RUNTIME_CLASS_NAME mechanism used elsewhere in E2E (e.g., mkobjs-openshift.sh injects runtimeClassName when RUNTIME_CLASS_NAME is set). On clusters where GPU workloads require a specific runtimeClass (notably some OpenShift setups), the probe may be rejected or never reach Running, blocking the entire suite. Consider conditionally adding spec.runtimeClassName: $RUNTIME_CLASS_NAME to the probe manifest when the env var is set.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

Before creating any test objects, launch a throwaway Pod that requests
2 GPUs.  The scheduler places it on a node that actually has 2 GPUs
free right now.  Record that node, delete the probe, and pin the
requester ReplicaSet to it via a new --node flag accepted by both
mkobjs scripts.  This prevents spurious failures on shared clusters
where GPU availability is dynamic (Issue llm-d-incubation#422).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Spreitzer <mspreitz@us.ibm.com>
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

/ok-to-test

@github-actions
Copy link
Copy Markdown

🚀 E2E tests triggered by /ok-to-test

View the OpenShift E2E workflow run

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

Oh crud. I put this PR's branch in the wrong fork, so the modification to the test workflow was not tested here. However, you can see the result of testing that in #426 .

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator Author

The E2E test on OpenShift succeeded.

@MikeSpreitzer MikeSpreitzer merged commit 750d1b2 into llm-d-incubation:main Apr 16, 2026
26 checks passed
@MikeSpreitzer MikeSpreitzer deleted the fix-422-probe-node branch April 16, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: E2E test should be smarter about GPU saturation

3 participants