feat: merge gpu-cluster setup into single e2e script via flag#3281
feat: merge gpu-cluster setup into single e2e script via flag#3281aniket2405 wants to merge 5 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR merges the standalone hack/e2e-setup-gpu-cluster.sh GPU cluster setup script into the primary hack/e2e-setup-cluster.sh by adding a --gpu-cluster flag. When the flag is passed, the script conditionally provisions an nvkind GPU cluster, installs the NVIDIA GPU Operator via Helm, builds additional initializer/trainer images, and patches CRDs with runtimeClassName: nvidia.
Changes:
- Added
--gpu-clusterflag parsing and conditional GPU logic (nvkind cluster creation, GPU Operator installation, extra image builds, CRD patching) to the unifiedhack/e2e-setup-cluster.sh. - Updated the
test-e2e-setup-gpu-clusterMakefile target to pass--gpu-clusterto the setup script.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
hack/e2e-setup-cluster.sh |
Adds GPU cluster setup logic gated behind the new --gpu-cluster flag |
Makefile |
Updates test-e2e-setup-gpu-cluster target to pass --gpu-cluster flag |
hack/e2e-setup-cluster.sh
Outdated
| apiVersion: kustomize.config.k8s.io/v1beta1 | ||
| kind: Kustomization | ||
| resources: | ||
| - ../../../manifests/overlays/runtimes | ||
| images: | ||
| - name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}" | ||
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | ||
| - name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}" | ||
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | ||
| - name: "${TRAINER_CI_IMAGE_NAME}" | ||
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" |
There was a problem hiding this comment.
The kustomization YAML content in this heredoc has 4 spaces of leading indentation on every line, while the identical CPU-path heredoc at line 147 uses 2 spaces. Since cat <<EOF writes the content literally (including leading whitespace), the generated kustomization.yaml file will have every line prefixed with 4 extra spaces. Although YAML parsers typically accept this, it differs from both the original e2e-setup-gpu-cluster.sh (2-space indent) and the CPU-path heredoc in this same file, and could be confusing for manual inspection of generated artifacts. The indentation should be consistent with the other heredoc in this file (2 spaces).
| apiVersion: kustomize.config.k8s.io/v1beta1 | |
| kind: Kustomization | |
| resources: | |
| - ../../../manifests/overlays/runtimes | |
| images: | |
| - name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | |
| - name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | |
| - name: "${TRAINER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | |
| apiVersion: kustomize.config.k8s.io/v1beta1 | |
| kind: Kustomization | |
| resources: | |
| - ../../../manifests/overlays/runtimes | |
| images: | |
| - name: "${DATASET_INITIALIZER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | |
| - name: "${MODEL_INITIALIZER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | |
| - name: "${TRAINER_CI_IMAGE_NAME}" | |
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" |
|
/ok-to-test |
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
a9b7096 to
75cea58
Compare
|
Hi @aniket2405, After reviewing this again, I realized that the initial approach I suggested—merging the scripts—doesn’t make much sense. Keeping the scripts separate is actually better, since it makes testing and reviewing easier. Instead, what we can do is merge the CI workflows: test-e2e.yaml and test-e2e-gpu.yaml. The goal would be to have a single CI workflow that triggers both CPU and GPU E2E jobs. |
|
Hi @aniket2405 gentle reminder about above, let me know if I can help with anything. |
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
|
Hi @jaiakash, testing is definitely easier and less complex with separate scripts. Agreed. |
Signed-off-by: aniket2405 <aniketshaha2001@gmail.com>
|
Overall its looks good to me, check the CI failure. I will lgtm it then. |
|
@jaiakash Are there any objections to consolidate cluster creation into a single script? Since we build Trainer images + install control plane in a same way for CPU/GPU cluster, we can consolidate it. WDYT @jaiakash @aniket2405 ? |
|
My pov is that -
Nevertheless, the GPU and E2E Test CI should be merged into single workflow file, using matrix as Aniket has done in commit WDYT @aniket2405 ? |
|
Hi @andreyvelich and @jaiakash, I agree with @andreyvelich that a lot of the logic is shared, but I lean slightly toward @jaiakash's suggestion of keeping the scripts separate for two main reasons from my pov -
WDYT? Also, this is the correct commit to reference - 1640806 |
|
@jaiakash Could you explain why do you want two separate scripts for GPU and CPU cluster? I think, we can reduce duplication if we unify this script. |
What this PR does / why we need it:
This PR consolidates the E2E testing infrastructure by merging the GPU testing job into the main CPU testing workflow. This provides a unified view of all end-to-end results while maintaining separate setup scripts for each.
Combined test-e2e.yaml and test-e2e-gpu.yaml into a single entry point. Both CPU and GPU E2E tests are now triggered as parallel jobs within the same workflow.
Testing Performed:
Verified YAML syntax for the combined workflow.
CI tests will handle.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #2812
Checklist: