feat: merge gpu-cluster setup into single e2e script via flag by aniket2405 · Pull Request #3281 · kubeflow/trainer

aniket2405 · 2026-03-04T11:13:14Z

What this PR does / why we need it:

This PR consolidates the E2E testing infrastructure by merging the GPU testing job into the main CPU testing workflow. This provides a unified view of all end-to-end results while maintaining separate setup scripts for each.

Combined test-e2e.yaml and test-e2e-gpu.yaml into a single entry point. Both CPU and GPU E2E tests are now triggered as parallel jobs within the same workflow.

Testing Performed:

Verified YAML syntax for the combined workflow.
CI tests will handle.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2812

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2026-03-04T11:13:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-04T11:13:24Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR merges the standalone hack/e2e-setup-gpu-cluster.sh GPU cluster setup script into the primary hack/e2e-setup-cluster.sh by adding a --gpu-cluster flag. When the flag is passed, the script conditionally provisions an nvkind GPU cluster, installs the NVIDIA GPU Operator via Helm, builds additional initializer/trainer images, and patches CRDs with runtimeClassName: nvidia.

Changes:

Added --gpu-cluster flag parsing and conditional GPU logic (nvkind cluster creation, GPU Operator installation, extra image builds, CRD patching) to the unified hack/e2e-setup-cluster.sh.
Updated the test-e2e-setup-gpu-cluster Makefile target to pass --gpu-cluster to the setup script.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`hack/e2e-setup-cluster.sh`	Adds GPU cluster setup logic gated behind the new `--gpu-cluster` flag
`Makefile`	Updates `test-e2e-setup-gpu-cluster` target to pass `--gpu-cluster` flag

Makefile

Copilot · 2026-03-04T11:16:53Z

hack/e2e-setup-cluster.sh

+    apiVersion: kustomize.config.k8s.io/v1beta1
+    kind: Kustomization
+    resources:
+    - ../../../manifests/overlays/runtimes
+    images:
+    - name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"
+      newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
+    - name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"
+      newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
+    - name: "${TRAINER_CI_IMAGE_NAME}"
+      newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"


The kustomization YAML content in this heredoc has 4 spaces of leading indentation on every line, while the identical CPU-path heredoc at line 147 uses 2 spaces. Since cat <<EOF writes the content literally (including leading whitespace), the generated kustomization.yaml file will have every line prefixed with 4 extra spaces. Although YAML parsers typically accept this, it differs from both the original e2e-setup-gpu-cluster.sh (2-space indent) and the CPU-path heredoc in this same file, and could be confusing for manual inspection of generated artifacts. The indentation should be consistent with the other heredoc in this file (2 spaces).

Suggested change

apiVersion: kustomize.config.k8s.io/v1beta1

kind: Kustomization

resources:

- ../../../manifests/overlays/runtimes

images:

- name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

- name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

- name: "${TRAINER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

apiVersion: kustomize.config.k8s.io/v1beta1

kind: Kustomization

resources:

- ../../../manifests/overlays/runtimes

images:

- name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

- name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

- name: "${TRAINER_CI_IMAGE_NAME}"

newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

jaiakash · 2026-03-05T04:51:23Z

/ok-to-test

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

jaiakash · 2026-03-07T10:10:39Z

Hi @aniket2405,

After reviewing this again, I realized that the initial approach I suggested—merging the scripts—doesn’t make much sense. Keeping the scripts separate is actually better, since it makes testing and reviewing easier.

Instead, what we can do is merge the CI workflows: test-e2e.yaml and test-e2e-gpu.yaml. The goal would be to have a single CI workflow that triggers both CPU and GPU E2E jobs.

jaiakash · 2026-03-09T18:35:59Z

Hi @aniket2405 gentle reminder about above, let me know if I can help with anything.

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

aniket2405 · 2026-03-09T19:54:28Z

Hi @jaiakash, testing is definitely easier and less complex with separate scripts. Agreed.
I've made the changes to merge the two CI workflows. Please review.
Thanks for the help!

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

jaiakash · 2026-03-10T07:36:41Z

Overall its looks good to me, check the CI failure. I will lgtm it then.

andreyvelich · 2026-03-10T15:56:40Z

@jaiakash Are there any objections to consolidate cluster creation into a single script?
Looking at this, we just need to use nvkind as a binary for Kind cluster, install GPU operator, and patch Runtimes with runtimeClassName.

Since we build Trainer images + install control plane in a same way for CPU/GPU cluster, we can consolidate it.

WDYT @jaiakash @aniket2405 ?

jaiakash · 2026-03-10T19:46:56Z

My pov is that -

Cluster creation script for normal or GPU cluster via nvkind should be a different script. This makes debugging and testing easier, since we can run and iterate on the exact setup we need, without modifying a shared script.
Also patching runtime is hotfix we are doing before nvkind officially supports the newer CDI-based architecture ( Related meet a error when trying to use nvkind NVIDIA/nvkind#61 (comment))

Nevertheless, the GPU and E2E Test CI should be merged into single workflow file, using matrix as Aniket has done in commit a9683ef (this PR)

WDYT @aniket2405 ?

aniket2405 · 2026-03-10T20:21:15Z

Hi @andreyvelich and @jaiakash,

I agree with @andreyvelich that a lot of the logic is shared, but I lean slightly toward @jaiakash's suggestion of keeping the scripts separate for two main reasons from my pov -

Keeping it in a separate GPU-specific script prevents the primary e2e-setup-cluster.sh from becoming cluttered with workarounds that might be removed once nvkind supports the newer CDI-based architecture.
Modifying the script requires a non-trivial gating logic that can hamper readability.

WDYT?

Also, this is the correct commit to reference - 1640806

andreyvelich · 2026-03-11T14:25:26Z

@jaiakash Could you explain why do you want two separate scripts for GPU and CPU cluster?
Basically, we run the same command to build images + deploy Trainer control plane:
CPU: https://github.com/kubeflow/trainer/blob/master/hack/e2e-setup-cluster.sh#L42-L79
GPU: https://github.com/kubeflow/trainer/blob/master/hack/e2e-setup-gpu-cluster.sh#L116-L128

I think, we can reduce duplication if we unify this script.

Copilot AI review requested due to automatic review settings March 4, 2026 11:13

google-oss-prow bot requested review from akshaychitneni and jinchihe March 4, 2026 11:13

google-oss-prow bot added the size/L label Mar 4, 2026

Copilot started reviewing on behalf of aniket2405 March 4, 2026 11:13 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

google-oss-prow bot added the ok-to-test label Mar 5, 2026

aniket2405 added 2 commits March 6, 2026 04:02

Refactor and merge setup-gpu-cluster into single script based on flag

69556f8

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

Update gpu script to call the older script

75cea58

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

aniket2405 force-pushed the script/setup-gpu-cluster2 branch from a9b7096 to 75cea58 Compare March 5, 2026 22:33

jaiakash mentioned this pull request Mar 7, 2026

feat: refactor and merge setup-gpu-cluster into single script based on flag. #2812

Open

merge CPU and GPU E2E workflows into a single file

1640806

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

google-oss-prow bot added size/M and removed size/L labels Mar 9, 2026

Remove separate test-e2e-gpu.yaml file

d349778

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

google-oss-prow bot added size/L and removed size/M labels Mar 9, 2026

Extra space added back

a9683ef

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>

Conversation

aniket2405 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

jaiakash Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

jaiakash commented Mar 5, 2026

Uh oh!

jaiakash commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaiakash commented Mar 9, 2026

Uh oh!

aniket2405 commented Mar 9, 2026

Uh oh!

jaiakash commented Mar 10, 2026

Uh oh!

andreyvelich commented Mar 10, 2026

Uh oh!

jaiakash commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aniket2405 commented Mar 10, 2026

Uh oh!

andreyvelich commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aniket2405 commented Mar 4, 2026 •

edited

Loading

jaiakash commented Mar 7, 2026 •

edited

Loading

jaiakash commented Mar 10, 2026 •

edited

Loading