Skip to content

feat: merge gpu-cluster setup into single e2e script via flag#3281

Open
aniket2405 wants to merge 5 commits intokubeflow:masterfrom
aniket2405:script/setup-gpu-cluster2
Open

feat: merge gpu-cluster setup into single e2e script via flag#3281
aniket2405 wants to merge 5 commits intokubeflow:masterfrom
aniket2405:script/setup-gpu-cluster2

Conversation

@aniket2405
Copy link

@aniket2405 aniket2405 commented Mar 4, 2026

What this PR does / why we need it:

This PR consolidates the E2E testing infrastructure by merging the GPU testing job into the main CPU testing workflow. This provides a unified view of all end-to-end results while maintaining separate setup scripts for each.

Combined test-e2e.yaml and test-e2e-gpu.yaml into a single entry point. Both CPU and GPU E2E tests are now triggered as parallel jobs within the same workflow.

Testing Performed:

Verified YAML syntax for the combined workflow.
CI tests will handle.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2812

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 4, 2026 11:13
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges the standalone hack/e2e-setup-gpu-cluster.sh GPU cluster setup script into the primary hack/e2e-setup-cluster.sh by adding a --gpu-cluster flag. When the flag is passed, the script conditionally provisions an nvkind GPU cluster, installs the NVIDIA GPU Operator via Helm, builds additional initializer/trainer images, and patches CRDs with runtimeClassName: nvidia.

Changes:

  • Added --gpu-cluster flag parsing and conditional GPU logic (nvkind cluster creation, GPU Operator installation, extra image builds, CRD patching) to the unified hack/e2e-setup-cluster.sh.
  • Updated the test-e2e-setup-gpu-cluster Makefile target to pass --gpu-cluster to the setup script.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
hack/e2e-setup-cluster.sh Adds GPU cluster setup logic gated behind the new --gpu-cluster flag
Makefile Updates test-e2e-setup-gpu-cluster target to pass --gpu-cluster flag

Comment on lines +184 to +194
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../manifests/overlays/runtimes
images:
- name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${TRAINER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kustomization YAML content in this heredoc has 4 spaces of leading indentation on every line, while the identical CPU-path heredoc at line 147 uses 2 spaces. Since cat <<EOF writes the content literally (including leading whitespace), the generated kustomization.yaml file will have every line prefixed with 4 extra spaces. Although YAML parsers typically accept this, it differs from both the original e2e-setup-gpu-cluster.sh (2-space indent) and the CPU-path heredoc in this same file, and could be confusing for manual inspection of generated artifacts. The indentation should be consistent with the other heredoc in this file (2 spaces).

Suggested change
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../manifests/overlays/runtimes
images:
- name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${TRAINER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../manifests/overlays/runtimes
images:
- name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
- name: "${TRAINER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NA

@jaiakash
Copy link
Member

jaiakash commented Mar 5, 2026

/ok-to-test

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
@aniket2405 aniket2405 force-pushed the script/setup-gpu-cluster2 branch from a9b7096 to 75cea58 Compare March 5, 2026 22:33
@jaiakash
Copy link
Member

jaiakash commented Mar 7, 2026

Hi @aniket2405,

After reviewing this again, I realized that the initial approach I suggested—merging the scripts—doesn’t make much sense. Keeping the scripts separate is actually better, since it makes testing and reviewing easier.

Instead, what we can do is merge the CI workflows: test-e2e.yaml and test-e2e-gpu.yaml. The goal would be to have a single CI workflow that triggers both CPU and GPU E2E jobs.

@jaiakash
Copy link
Member

jaiakash commented Mar 9, 2026

Hi @aniket2405 gentle reminder about above, let me know if I can help with anything.

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Mar 9, 2026
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Mar 9, 2026
@aniket2405
Copy link
Author

Hi @jaiakash, testing is definitely easier and less complex with separate scripts. Agreed.
I've made the changes to merge the two CI workflows. Please review.
Thanks for the help!

Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
@jaiakash
Copy link
Member

Overall its looks good to me, check the CI failure. I will lgtm it then.

@andreyvelich
Copy link
Member

@jaiakash Are there any objections to consolidate cluster creation into a single script?
Looking at this, we just need to use nvkind as a binary for Kind cluster, install GPU operator, and patch Runtimes with runtimeClassName.

Since we build Trainer images + install control plane in a same way for CPU/GPU cluster, we can consolidate it.

WDYT @jaiakash @aniket2405 ?

@jaiakash
Copy link
Member

jaiakash commented Mar 10, 2026

My pov is that -

  • Cluster creation script for normal or GPU cluster via nvkind should be a different script. This makes debugging and testing easier, since we can run and iterate on the exact setup we need, without modifying a shared script.
  • Also patching runtime is hotfix we are doing before nvkind officially supports the newer CDI-based architecture ( Related meet a error when trying to use nvkind NVIDIA/nvkind#61 (comment))

Nevertheless, the GPU and E2E Test CI should be merged into single workflow file, using matrix as Aniket has done in commit a9683ef (this PR)

WDYT @aniket2405 ?

@aniket2405
Copy link
Author

Hi @andreyvelich and @jaiakash,

I agree with @andreyvelich that a lot of the logic is shared, but I lean slightly toward @jaiakash's suggestion of keeping the scripts separate for two main reasons from my pov -

  1. Keeping it in a separate GPU-specific script prevents the primary e2e-setup-cluster.sh from becoming cluttered with workarounds that might be removed once nvkind supports the newer CDI-based architecture.

  2. Modifying the script requires a non-trivial gating logic that can hamper readability.

WDYT?

Also, this is the correct commit to reference - 1640806

@andreyvelich
Copy link
Member

@jaiakash Could you explain why do you want two separate scripts for GPU and CPU cluster?
Basically, we run the same command to build images + deploy Trainer control plane:
CPU: https://github.com/kubeflow/trainer/blob/master/hack/e2e-setup-cluster.sh#L42-L79
GPU: https://github.com/kubeflow/trainer/blob/master/hack/e2e-setup-gpu-cluster.sh#L116-L128

I think, we can reduce duplication if we unify this script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: refactor and merge setup-gpu-cluster into single script based on flag.

4 participants