test: initial implementation of SDK e2e#488
test: initial implementation of SDK e2e#488openshift-merge-bot[bot] merged 5 commits intoopendatahub-io:mainfrom
Conversation
WalkthroughAdds tests and helpers to run a Kubeflow/Papermill-driven distributed FashionMNIST training: a new Jupyter notebook, orchestration test functions and utilities for notebook lifecycle and cluster preparation, a sanity test entry, and a .gitignore rule to ignore Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Runner
participant K8s as Kubernetes API
participant Notebook as Notebook Pod
participant Trainer as Kubeflow Trainer
Test->>K8s: Create namespace & Ensure ServiceAccount
Test->>K8s: Create ConfigMap (notebook bytes) & RWX PVC
Test->>K8s: Create Notebook CR (papermill command)
rect rgb(235,247,255)
Note over Notebook: Pod boots, installs deps, runs papermill (gloo, CPU)
Notebook->>Trainer: Discover runtimes & submit CustomTrainer job
Trainer->>Trainer: Launch distributed worker steps
Trainer->>Notebook: Stream job logs/status
Notebook->>Notebook: Emit completion marker in logs (NOTEBOOK_STATUS: SUCCESS/FAILURE)
end
Test->>Notebook: Poll logs until SUCCESS/FAILURE or timeout
alt SUCCESS
Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
else FAILURE or timeout
Test->>K8s: Collect logs, delete resources
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (4)
tests/trainer/utils/utils_cluster_prep.go (2)
53-55: Consider making expected runtimes configurable.The hardcoded list of ClusterTrainingRuntimes (
torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:
-func EnsureTrainerClusterReady(t *testing.T, test Test) { +func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) { + if len(expectedRuntimes) == 0 { + // Default runtimes if none specified + expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} + } ... - for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} { + for _, name := range expectedRuntimes {
64-66: Hardcoded ServiceAccount name reduces reusability.The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches
common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.Consider accepting the SA name as a parameter:
-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) { +func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) { t.Helper() - saName := "jupyter-nb-kube-3aadmin"Alternatively, if this SA name is always tied to notebooks, accept it from
common.NOTEBOOK_CONTAINER_NAMEexplicitly to make the dependency clear.tests/trainer/resources/mnist.ipynb (1)
17-125: Consider adding training validation.The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.
For example, after training completes, you could add:
# At the end of train_fashion_mnist, before dist.destroy_process_group() if dist.get_rank() == 0: if loss.item() > initial_loss: print("WARNING: Loss did not decrease during training") print(f"Final loss: {loss.item():.6f}")tests/trainer/utils/utils_notebook.go (1)
49-53: Unused helper function.
CreateNotebookFromBytesis defined but not used anywhere in this PR. The test infashion_mnist_tests.gocreates the ConfigMap and Notebook separately.Consider either:
- Using this helper in
fashion_mnist_tests.goto reduce duplication- Removing it if it's not needed yet
- Adding a comment explaining it's for future use
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)tests/trainer/kubeflow_sdk_test.go(1 hunks)tests/trainer/resources/mnist.ipynb(1 hunks)tests/trainer/sdk_tests/fashion_mnist_tests.go(1 hunks)tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
Test(34-45)tests/common/notebook.go (3)
ContainerSize(72-72)CreateNotebook(104-185)NOTEBOOK_CONTAINER_NAME(37-37)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)GetPods(111-116)tests/common/support/support.go (1)
TestTimeoutLong(35-35)
tests/trainer/kubeflow_sdk_test.go (2)
tests/common/test_tag.go (2)
Tags(32-40)Sanity(48-50)tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
RunFashionMnistCpuDistributedTraining(39-86)
tests/trainer/sdk_tests/fashion_mnist_tests.go (8)
tests/common/support/test.go (2)
T(90-102)With(61-63)tests/trainer/utils/utils_cluster_prep.go (2)
EnsureTrainerClusterReady(33-56)EnsureNotebookRBAC(60-83)tests/common/environment.go (2)
GetNotebookUserName(70-76)GetNotebookUserToken(78-84)tests/common/support/rbac.go (1)
CreateUserRoleBindingWithClusterRole(206-238)tests/common/support/core.go (3)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)tests/trainer/utils/utils_notebook.go (3)
BuildPapermillShellCmd(36-46)WaitForNotebookPodRunning(56-64)PollNotebookCompletionMarker(67-85)tests/common/notebook.go (4)
CreateNotebook(104-185)ContainerSizeSmall(75-75)DeleteNotebook(187-190)Notebooks(192-204)tests/common/support/support.go (2)
TestTimeoutLong(35-35)TestTimeoutDouble(36-36)
tests/trainer/utils/utils_cluster_prep.go (4)
tests/common/support/test.go (2)
T(90-102)Test(34-45)tests/common/support/client.go (1)
Client(39-50)tests/common/support/core.go (2)
ServiceAccount(194-200)ServiceAccounts(207-219)tests/common/support/rbac.go (4)
CreateClusterRole(48-70)CreateClusterRoleBinding(135-169)CreateRole(27-46)CreateRoleBinding(72-102)
🪛 GitHub Actions: Verify Generated Files and Import Organization
tests/trainer/utils/utils_notebook.go
[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.
tests/trainer/utils/utils_cluster_prep.go
[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.
🔇 Additional comments (5)
.gitignore (1)
2-2: LGTM!Adding
.vscode/*to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.tests/trainer/kubeflow_sdk_test.go (1)
26-30: LGTM!Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.
tests/trainer/sdk_tests/fashion_mnist_tests.go (2)
39-86: Well-structured test implementation.The test follows good practices:
- Proper test namespace isolation
- Prerequisite checks before execution
- RBAC setup
- Resource cleanup with defer
- Clear error messages
The flow is logical and the use of helper utilities keeps the test readable.
33-36: Relative path verified—notebook file found at expected location.The notebook file exists at
tests/trainer/resources/mnist.ipynb, confirming the relative path"resources/"resolves correctly when tests run from the package directory.tests/trainer/utils/utils_notebook.go (1)
36-46: LGTM!The Papermill shell command construction is clean and handles optional extra packages appropriately. The
set -eensures failures are caught, and the marker file pattern is a good approach for async completion detection.
| if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil { | ||
| t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out)) | ||
| } |
There was a problem hiding this comment.
Replace kubectl shell command with Kubernetes client API.
Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().
Apply this diff to use the Kubernetes API client:
- if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
- t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
- }
+ _, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{
+ Group: "apiextensions.k8s.io",
+ Version: "v1",
+ Resource: "customresourcedefinitions",
+ }).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{})
+ test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the
exec.Command("kubectl", ...) call with a Kubernetes API call using the provided
test.Client(): call
test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx,
"jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g.,
context.Background()), and if it returns an error fail the test with t.Fatalf
including the error; remove the exec.Command usage and its output handling and
ensure the file imports context and metav1 (and any apiextensions client types
if missing).
There was a problem hiding this comment.
This is a good suggestion!
Will switch to it most likely but after comments from others.
There was a problem hiding this comment.
@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.
There was a problem hiding this comment.
This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.
| if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil { | ||
| t.Fatalf("Trainer controller not available: %v\n%s", err, string(out)) | ||
| } |
There was a problem hiding this comment.
Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.
Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.
Apply this diff to use the Kubernetes API client with Eventually for polling:
- if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
- t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
- }
+ odhNamespace := "opendatahub" // Consider making this configurable
+ test.Eventually(func(g Gomega) {
+ deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{})
+ g.Expect(err).NotTo(HaveOccurred())
+ g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{
+ "Type": Equal(appsv1.DeploymentAvailable),
+ "Status": Equal(corev1.ConditionTrue),
+ })))
+ }, 180*time.Second).Should(Succeed(), "Trainer controller not available")Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Fetch all review comments from this PR to find the "above" comment being referenced
gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50
Length of output: 25
🏁 Script executed:
#!/bin/bash
# Try fetching review comments using the GitHub API directly
gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100
Length of output: 18212
@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.
🐰✨
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/trainer/utils/utils_notebook.go (2)
17-32: Fix import organization per pipeline failure.The pipeline check failed because imports are not properly sorted. Run
make importsto fix the import organization.
66-85: Replace kubectl shell command with Kubernetes client API.Using
exec.Command("kubectl", ...)is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.
🧹 Nitpick comments (1)
tests/trainer/utils/utils_notebook.go (1)
55-64: Eliminate redundant GetPods call.Lines 62-63 repeat the same
GetPodscall that was already executed in theEventuallyblock (lines 58-60), which is inefficient.Apply this diff to store and reuse the pods result:
func WaitForNotebookPodRunning(test Test, namespace string) (string, string) { labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME) + var pods []corev1.Pod test.Eventually(func() []corev1.Pod { - return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) + pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) + return pods }, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod") - pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"}) return pods[0].Name, pods[0].Spec.Containers[0].Name }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/utils/utils_cluster_prep.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
Test(34-45)tests/common/notebook.go (3)
ContainerSize(72-72)CreateNotebook(104-185)NOTEBOOK_CONTAINER_NAME(37-37)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)GetPods(111-116)tests/common/support/support.go (1)
TestTimeoutLong(35-35)
🔇 Additional comments (2)
tests/trainer/utils/utils_notebook.go (2)
34-46: LGTM! Papermill command construction is sound.The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses
sleep infinityto keep the container running for inspection. The use ofset -eensures proper error propagation.
48-53: LGTM! Notebook setup orchestration is correct.The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.
| " dataset = datasets.FashionMNIST(\n", | ||
| " \"./data\",\n", | ||
| " train=True,\n", | ||
| " download=True,\n", |
There was a problem hiding this comment.
This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67
| "for runtime in client.list_runtimes():\n", | ||
| " print(runtime)\n", | ||
| " if runtime.name == \"universal\": # Update to actual universal image runtime once available\n", | ||
| " torch_runtime = runtime" |
There was a problem hiding this comment.
Why this approach instead of getting universal runtime by client.get_runtime("universal")?
There was a problem hiding this comment.
That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!
| " \"cpu\": 2,\n", | ||
| " \"memory\": \"8Gi\",\n", | ||
| " },\n", | ||
| " packages_to_install=[\"torchvision\"],\n", |
There was a problem hiding this comment.
I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.
There was a problem hiding this comment.
yes I think it makes sense. Thanks.
| } | ||
| // TODO: Extend / tweak with universal image runtime once available | ||
| for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} { | ||
| test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name)) |
There was a problem hiding this comment.
Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?
There was a problem hiding this comment.
Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?
|
|
||
| // EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can | ||
| // read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace. | ||
| func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) { |
There was a problem hiding this comment.
I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.
There was a problem hiding this comment.
Agree, thanks Karel
| // PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout. | ||
| func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error { | ||
| var finalErr error | ||
| test.Eventually(func() bool { |
There was a problem hiding this comment.
Thinking about whether it would be simpler to print status into Workbench log directly.
It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?
|
|
||
| // Wait for the Notebook Pod and get pod/container names | ||
| podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name) | ||
|
|
There was a problem hiding this comment.
Would it have sense to add assertion checking that TrainJob is created and successfully finished?
There was a problem hiding this comment.
Yes, I think this is very important :)
Thanks
There was a problem hiding this comment.
Hey, as agreed, we are going to leave it out.
|
hey @sutaakar
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/trainer/utils/utils_notebook.go (1)
34-39: Helpers are solid; consider eventually replacingkubectl logswith Kubernetes client calls.The three helpers fit together nicely:
CreateNotebookFromBytesabstracts ConfigMap + Notebook + PVC creation.WaitForNotebookPodRunningcorrectly waits for a single Running notebook pod using the shared label/constant.PollNotebookLogsForStatusgives a clean SUCCESS/FAILURE signal based on log markers.One medium-term improvement (already raised earlier) is the reliance on
exec.Command("kubectl", ...)inPollNotebookLogsForStatus. It couples the tests to an external binary and kubeconfig, even though you already have a Kubernetes client intest.Client()that can stream logs directly.When these tests move from “temporary” to more permanent, it would be worth switching to a client-go–based log reader (or pod exec) to remove the
kubectldependency and keep everything within the API client.#!/bin/bash # Inventory remaining direct kubectl usages to clean up once the approach is finalized. rg -n 'exec.Command\("kubectl"' tests -C2 || trueAlso applies to: 41-50, 52-71
🧹 Nitpick comments (2)
tests/trainer/utils/utils_cluster_prep.go (1)
31-38: ServiceAccount helper looks good; consider reusing shared constant for the name.The creation logic and
IsAlreadyExistshandling are solid. To avoid future drift with other helpers (e.g.,common.NOTEBOOK_CONTAINER_NAMEused for notebook naming and labels), consider sourcingsaNamefrom that shared constant instead of duplicating the literal"jupyter-nb-kube-3aadmin".This keeps the SA name aligned across notebook creation, deletion, and pod selection helpers.
tests/trainer/resources/mnist.ipynb (1)
47-79: Align comment with current behavior and tidy unused unpacked variable.Two small polish items in the training cell:
- The comment:
# Prefer shared PVC if present; else fallback to internet download (rank 0 only)no longer matches the logic:
use_pvc = os.path.isdir(pvc_raw) and any(os.scandir(pvc_raw)) if not use_pvc: raise RuntimeError("Shared PVC not mounted or empty at /mnt/shared/FashionMNIST/raw; this test requires a pre-populated RWX PVC")Since the function now strictly requires a pre-populated PVC (and upstream cells handle S3/HTTP fallback), consider updating the comment to reflect that requirement to avoid confusion for future readers.
- In
_read_idx_labels,numis unpacked but unused:magic, num = struct.unpack(">II", f.read(8))Changing this to
magic, _ = struct.unpack(">II", f.read(8))(or similar) makes the intent explicit and keeps linters quiet.
These are non-functional cleanups but will make the notebook easier to maintain.
Also applies to: 83-97
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/trainer/kubeflow_sdk_test.go(1 hunks)tests/trainer/resources/mnist.ipynb(1 hunks)tests/trainer/sdk_tests/fashion_mnist_tests.go(1 hunks)tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/kubeflow_sdk_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
tests/trainer/sdk_tests/fashion_mnist_tests.go (11)
tests/common/support/test.go (2)
T(90-102)With(61-63)tests/trainer/utils/utils_cluster_prep.go (1)
EnsureNotebookServiceAccount(31-39)tests/common/environment.go (2)
GetNotebookUserName(70-76)GetNotebookUserToken(78-84)tests/common/support/rbac.go (1)
CreateUserRoleBindingWithClusterRole(206-238)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)StorageClassName(228-233)tests/common/support/environment.go (5)
GetStorageBucketDefaultEndpoint(152-155)GetStorageBucketAccessKeyId(162-165)GetStorageBucketSecretKey(167-170)GetStorageBucketName(172-175)GetStorageBucketMnistDir(177-180)tests/common/support/storage.go (1)
GetRWXStorageClass(23-36)tests/common/support/config.go (1)
GetOpenShiftApiUrl(44-56)tests/common/notebook.go (4)
CreateNotebook(104-185)ContainerSizeSmall(75-75)DeleteNotebook(187-190)Notebooks(192-204)tests/common/support/support.go (2)
TestTimeoutLong(35-35)TestTimeoutDouble(36-36)tests/trainer/utils/utils_notebook.go (2)
WaitForNotebookPodRunning(42-50)PollNotebookLogsForStatus(53-72)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
Test(34-45)tests/common/notebook.go (3)
ContainerSize(72-72)CreateNotebook(104-185)NOTEBOOK_CONTAINER_NAME(37-37)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)GetPods(111-116)tests/common/support/support.go (1)
TestTimeoutLong(35-35)
tests/trainer/utils/utils_cluster_prep.go (3)
tests/common/support/test.go (2)
T(90-102)Test(34-45)tests/common/support/core.go (2)
ServiceAccount(194-200)ServiceAccounts(207-219)tests/common/support/client.go (1)
Client(39-50)
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb
62-62: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Unpacked variable num is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
78-78: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Do not catch blind exception: Exception
(BLE001)
207-207: Do not catch blind exception: Exception
(BLE001)
298-298: Do not catch blind exception: Exception
(BLE001)
303-305: try-except-pass detected, consider logging the exception
(S110)
303-303: Do not catch blind exception: Exception
(BLE001)
311-311: Do not catch blind exception: Exception
(BLE001)
348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
358-358: Do not catch blind exception: Exception
(BLE001)
392-392: Do not catch blind exception: Exception
(BLE001)
398-398: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
398-398: Avoid specifying long messages outside the exception class
(TRY003)
|
@MStokluska Instead of copying the notebook, would it be possible to reference it, in a similar way to what we do for PR testing here? |
With the amount of edits required - not really. For example - the shared PV between notebook/trainer workers is needed due to how we download dataset to prep for isolated envs (and there's more subtle differences..) |
| " print(\"[notebook] Internet download completed successfully\")\n", | ||
| " except Exception as e:\n", | ||
| " print(f\"[notebook] Internet download failed: {e}\")\n", | ||
| " print(\"[notebook] WARNING: Dataset may be incomplete!\")\n", |
There was a problem hiding this comment.
I would expect the test to fail in this case, WDYT?
it doesn't have sense to proceed without dataset.
There was a problem hiding this comment.
True, will fix it up. Thanks Karel
| " torch_runtime = client.get_runtime(\"torch-distributed\")\n", | ||
| "except Exception:\n", | ||
| " torch_runtime = next(\n", | ||
| " (r for r in client.list_runtimes() if r.name == \"torch-distributed\"),\n", |
There was a problem hiding this comment.
What is the purpose of this?
I would expect that if client.get_runtime(\"torch-distributed\") fails then torch-distributed doesn't exist.
There was a problem hiding this comment.
Thinking whether it would have sense to pass runtime as parameter/env variable for the notebook.
Will this notebook run for both CUDA and ROCm runtimes?
Edit - the notebook doesn't use GPU, so no reason to check specific runtimes.
There was a problem hiding this comment.
So the expectation here is that the "torch distributed" will be available.
The additional check is just in case there's something wrong with the get_runtime vs list_runtimes - but now that you mentioned it, I think it might make more sense to just get_runtime and fail if not there.
| func PollNotebookLogsForStatus(test Test, namespace, podName, containerName string, timeout time.Duration) error { | ||
| var finalErr error | ||
| test.Eventually(func() bool { | ||
| out, err := exec.Command("kubectl", "-n", namespace, "logs", podName, "-c", containerName, "--tail=2000").CombinedOutput() |
There was a problem hiding this comment.
You can reuse PodLog function here - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/common/support/core.go#L118
There was a problem hiding this comment.
There was a problem hiding this comment.
Awesome, thank you.
70f3c3e to
8ae0dee
Compare
|
@sutaakar I think all comments are addressed, let me know if you think there's anything else that I should improve. Thanks! |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
112-114: Refactor to idiomatic Gomega assertion.The current pattern is non-idiomatic: checking
if err != niland then expecting itTo(Succeed())will always fail when err is non-nil.Apply this refactor for clarity:
- if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil { - test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE") - } + err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble) + test.Expect(err).NotTo(HaveOccurred(), "Notebook execution reported FAILURE")
🧹 Nitpick comments (1)
tests/trainer/resources/mnist.ipynb (1)
401-421: Consider pinning the Kubeflow Trainer SDK version.For test reproducibility, you may want to pin the
kubeflow-trainerpackage version in the notebook environment or shell command (similar to howboto3==1.34.162is pinned on line 90 offashion_mnist_tests.go). This ensures future SDK changes don't unexpectedly break the test.If testing against the latest SDK version is intentional, you can safely ignore this suggestion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)tests/trainer/kubeflow_sdk_test.go(1 hunks)tests/trainer/resources/mnist.ipynb(1 hunks)tests/trainer/sdk_tests/fashion_mnist_tests.go(1 hunks)tests/trainer/utils/utils_cluster_prep.go(1 hunks)tests/trainer/utils/utils_notebook.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- .gitignore
- tests/trainer/kubeflow_sdk_test.go
- tests/trainer/utils/utils_notebook.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/sdk_tests/fashion_mnist_tests.go (11)
tests/common/support/test.go (2)
T(90-102)With(61-63)tests/trainer/utils/utils_cluster_prep.go (1)
EnsureNotebookServiceAccount(31-39)tests/common/environment.go (2)
GetNotebookUserName(70-76)GetNotebookUserToken(78-84)tests/common/support/rbac.go (1)
CreateUserRoleBindingWithClusterRole(206-238)tests/common/support/core.go (4)
CreateConfigMap(33-54)CreatePersistentVolumeClaim(242-276)AccessModes(235-240)StorageClassName(228-233)tests/common/support/environment.go (5)
GetStorageBucketDefaultEndpoint(152-155)GetStorageBucketAccessKeyId(162-165)GetStorageBucketSecretKey(167-170)GetStorageBucketName(172-175)GetStorageBucketMnistDir(177-180)tests/common/support/storage.go (1)
GetRWXStorageClass(23-36)tests/common/support/config.go (1)
GetOpenShiftApiUrl(44-56)tests/common/notebook.go (4)
CreateNotebook(104-185)ContainerSizeSmall(75-75)DeleteNotebook(187-190)Notebooks(192-204)tests/common/support/support.go (2)
TestTimeoutLong(35-35)TestTimeoutDouble(36-36)tests/trainer/utils/utils_notebook.go (2)
WaitForNotebookPodRunning(41-49)PollNotebookLogsForStatus(52-81)
tests/trainer/utils/utils_cluster_prep.go (3)
tests/common/support/test.go (2)
T(90-102)Test(34-45)tests/common/support/core.go (2)
ServiceAccount(194-200)ServiceAccounts(207-219)tests/common/support/client.go (1)
Client(40-52)
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb
62-62: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Unpacked variable num is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
78-78: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Do not catch blind exception: Exception
(BLE001)
207-207: Do not catch blind exception: Exception
(BLE001)
298-298: Do not catch blind exception: Exception
(BLE001)
303-305: try-except-pass detected, consider logging the exception
(S110)
303-303: Do not catch blind exception: Exception
(BLE001)
311-311: Do not catch blind exception: Exception
(BLE001)
348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
358-358: Do not catch blind exception: Exception
(BLE001)
360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
360-360: Avoid specifying long messages outside the exception class
(TRY003)
393-393: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
tests/trainer/utils/utils_cluster_prep.go (1)
29-39: LGTM! Clean helper implementation.The function correctly ensures the ServiceAccount exists, properly handles
AlreadyExistserrors, and fails with a clear message on genuine creation failures. The implementation is straightforward and follows Go testing best practices.
| if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil { | ||
| test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE") | ||
| } |
There was a problem hiding this comment.
| if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil { | |
| test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE") | |
| } | |
| err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble) | |
| test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE") |
| sdktests "github.com/opendatahub-io/distributed-workloads/tests/trainer/sdk_tests" | ||
| ) | ||
|
|
||
| func TestKubeflowSDK_Sanity(t *testing.T) { |
There was a problem hiding this comment.
| func TestKubeflowSDK_Sanity(t *testing.T) { | |
| func TestKubeflowSdkSanity(t *testing.T) { |
|
@MStokluska I ran the test using |
@sutaakar please run it with: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:odh-training-th03-157dbcf42c0dee17933360e343bfafe6-build-images or :latest Brians image might be outdated. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/trainer/resources/mnist.ipynb (3)
17-155: Distributed training function and IDX dataset reader look good; minor cleanups possibleThe CPU‑only DDP setup, strict PVC requirement at
/mnt/shared/FashionMNIST/raw, and IDX dataset handling all look consistent with the e2e test design and the shared PVC contract.Two small nits:
- In
_read_idx_labels, the unpackednumheader value is never used; consider renaming it to_/_numor adding a simple integrity check using it to satisfy linters.- The imports
from urllib.parse import urlparseandimport gzip, shutilin this cell are now unused and can be dropped to keep the cell minimal.
163-379: S3/HTTP dataset provisioning flow is reasonable; exception handling could be slightly tightenedThe dataset prep logic—writing into
FASHION_RAW_DIRon the notebook’s PVC, streaming S3/MinIO downloads with short timeouts, decompressing.gzfiles, and finally falling back to the Fashion‑MNIST HTTP endpoint only if needed—provides a good balance between robustness and test friendliness. The finalRuntimeError("Internet download failed; aborting test")ensures the test fails rather than training without data, which aligns with earlier feedback.Given the static analysis hints, the only tweaks you might consider (non‑blocking):
- Several
except Exception as ehandlers (S3 fetch, pagination, decompression, HTTP download) are intentionally broad. For clarity and to quiet BLE001/S110, you could either narrow them to the specific client/library error types you expect or at least logtype(e)/repr(e)before returning/raising, so failures are more diagnosable.- Where you intentionally swallow errors (e.g., failing to
os.remove(dst)after successful decompression), a brief comment as you already have (“Not critical...”) is enough; no change needed there.Overall, the behavior is fine for test code; these are just lint‑driven niceties.
Based on static analysis hints.
399-556: Kubeflow SDK integration and job lifecycle handling align with the test’s intent; consider future assertionsThe SDK client setup with
OPENSHIFT_API_URL/NOTEBOOK_TOKEN, runtime lookup for"torch-distributed"viaclient.get_runtime(...), and submission of a 2‑nodeCustomTrainerjob with the RWX PVC mounted at/mnt/sharedall look coherent with the notebook’strain_fashion_mnist()expectations.Job lifecycle cells (wait for
Running/Complete, print step summaries, stream logs, thendelete_job) are appropriate for a sanity‑level e2e. If you later want stronger validation, a possible follow‑up would be to assert on step/job statuses in the notebook itself (rather than only printing) to make failures more explicit, but that’s optional given the current log‑driven Go test harness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/trainer/kubeflow_sdk_test.go(1 hunks)tests/trainer/resources/mnist.ipynb(1 hunks)tests/trainer/sdk_tests/fashion_mnist_tests.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/trainer/kubeflow_sdk_test.go
🧰 Additional context used
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb
62-62: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Unpacked variable num is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
78-78: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Do not catch blind exception: Exception
(BLE001)
207-207: Do not catch blind exception: Exception
(BLE001)
298-298: Do not catch blind exception: Exception
(BLE001)
303-305: try-except-pass detected, consider logging the exception
(S110)
303-303: Do not catch blind exception: Exception
(BLE001)
311-311: Do not catch blind exception: Exception
(BLE001)
348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
358-358: Do not catch blind exception: Exception
(BLE001)
360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
360-360: Avoid specifying long messages outside the exception class
(TRY003)
393-393: Avoid specifying long messages outside the exception class
(TRY003)
| const ( | ||
| notebookName = "mnist.ipynb" | ||
| notebookPath = "resources/" + notebookName | ||
| ) | ||
|
|
||
| // CPU Only - Distributed Training | ||
| func RunFashionMnistCpuDistributedTraining(t *testing.T) { | ||
| test := support.With(t) | ||
|
|
||
| // Create a new test namespace | ||
| namespace := test.NewTestNamespace() | ||
|
|
||
| // Ensure Notebook ServiceAccount exists (no extra RBAC) | ||
| trainerutils.EnsureNotebookServiceAccount(t, test, namespace.Name) | ||
|
|
||
| // RBACs setup | ||
| userName := common.GetNotebookUserName(test) | ||
| userToken := common.GetNotebookUserToken(test) | ||
| support.CreateUserRoleBindingWithClusterRole(test, userName, namespace.Name, "admin") | ||
|
|
||
| // Create ConfigMap with notebook | ||
| localPath := notebookPath | ||
| nb, err := os.ReadFile(localPath) | ||
| test.Expect(err).NotTo(HaveOccurred(), fmt.Sprintf("failed to read notebook: %s", localPath)) | ||
| cm := support.CreateConfigMap(test, namespace.Name, map[string][]byte{notebookName: nb}) | ||
|
|
||
| // Build command with parameters and pinned deps, and print definitive status line to logs | ||
| endpoint, endpointOK := support.GetStorageBucketDefaultEndpoint() | ||
| accessKey, _ := support.GetStorageBucketAccessKeyId() | ||
| secretKey, _ := support.GetStorageBucketSecretKey() | ||
| bucket, bucketOK := support.GetStorageBucketName() | ||
| prefix, _ := support.GetStorageBucketMnistDir() | ||
| if !endpointOK { | ||
| endpoint = "" | ||
| } | ||
| if !bucketOK { | ||
| bucket = "" | ||
| } | ||
| // Create RWX PVC for shared dataset and pass the claim name to the notebook | ||
| storageClass, err := support.GetRWXStorageClass(test) | ||
| test.Expect(err).NotTo(HaveOccurred(), "Failed to find an RWX supporting StorageClass") | ||
| rwxPvc := support.CreatePersistentVolumeClaim( | ||
| test, | ||
| namespace.Name, | ||
| "20Gi", | ||
| support.AccessModes(corev1.ReadWriteMany), | ||
| support.StorageClassName(storageClass.Name), | ||
| ) | ||
|
|
||
| shellCmd := fmt.Sprintf( | ||
| "set -e; "+ | ||
| "export OPENSHIFT_API_URL='%s'; export NOTEBOOK_TOKEN='%s'; "+ | ||
| "export NOTEBOOK_NAMESPACE='%s'; "+ | ||
| "export SHARED_PVC_NAME='%s'; "+ | ||
| "export AWS_DEFAULT_ENDPOINT='%s'; export AWS_ACCESS_KEY_ID='%s'; "+ | ||
| "export AWS_SECRET_ACCESS_KEY='%s'; export AWS_STORAGE_BUCKET='%s'; "+ | ||
| "export AWS_STORAGE_BUCKET_MNIST_DIR='%s'; "+ | ||
| "python -m pip install --quiet --no-cache-dir papermill boto3==1.34.162 && "+ | ||
| "if python -m papermill -k python3 /opt/app-root/notebooks/%s /opt/app-root/src/out.ipynb --log-output; "+ | ||
| "then echo 'NOTEBOOK_STATUS: SUCCESS'; else echo 'NOTEBOOK_STATUS: FAILURE'; fi; sleep infinity", | ||
| support.GetOpenShiftApiUrl(test), userToken, namespace.Name, rwxPvc.Name, | ||
| endpoint, accessKey, secretKey, bucket, prefix, | ||
| notebookName, | ||
| ) | ||
| command := []string{"/bin/sh", "-c", shellCmd} | ||
|
|
||
| // Create Notebook CR using the RWX PVC | ||
| common.CreateNotebook(test, namespace, userToken, command, cm.Name, notebookName, 0, rwxPvc, common.ContainerSizeSmall) | ||
|
|
||
| // Cleanup | ||
| defer func() { | ||
| common.DeleteNotebook(test, namespace) | ||
| test.Eventually(common.Notebooks(test, namespace), support.TestTimeoutLong).Should(HaveLen(0)) | ||
| }() | ||
|
|
||
| // Wait for the Notebook Pod and get pod/container names | ||
| podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name) | ||
|
|
||
| // Poll logs to check if the notebook execution completed successfully | ||
| err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble) | ||
| test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE") | ||
|
|
||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
End‑to‑end test wiring looks good; please verify notebook path and consider a couple of small refinements
The overall flow of RunFashionMnistCpuDistributedTraining is solid: new namespace, Notebook SA and RBAC, RWX PVC for shared data, ConfigMap‑backed notebook, papermill execution with NOTEBOOK_STATUS markers, and cleanup with a final assertion that no notebooks remain.
A few details to double‑check/tidy up:
-
Notebook path resolution
notebookPathis defined as:const ( notebookName = "mnist.ipynb" notebookPath = "resources/" + notebookName )
while the notebook added in this PR lives at
tests/trainer/resources/mnist.ipynb. If there isn’t also atests/trainer/sdk_tests/resources/mnist.ipynb(or symlink) in the repo, running the tests from thesdk_testspackage will fail with “failed to read notebook: resources/mnist.ipynb”. Please confirm the actual layout and either adjustnotebookPath(e.g."../resources/"+notebookName) or move/link the file so the relative path is correct. -
SHARED_PVC_NAME env
You export
SHARED_PVC_NAMEinto the notebook container, but the currentmnist.ipynbdoesn’t read this variable (it uses fixed mount points/opt/app-root/srcand/mnt/shared). That env var is therefore redundant right now; feel free to drop it unless you plan to consume it in the notebook later. -
Shell command robustness
The
shellCmduses single‑quotedexport VAR='%s'substitutions. If any of the substituted values ever contains a single quote, the shell will misparse the command. It’s unlikely for most of these (endpoint, bucket, access key), but not impossible for tokens. Longer term, it might be more robust to:- Pass these values via the Notebook CR’s
envfields instead of through the shell, or - Add proper shell escaping around the interpolated values.
Not urgent, but worth considering if this test is run in more varied environments.
- Pass these values via the Notebook CR’s
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for mnist.ipynb locations under the repo root..."
fd -a 'mnist.ipynb' .
echo
echo "Checking path resolution from tests/trainer/sdk_tests..."
cd tests/trainer/sdk_tests
if [[ -f resources/mnist.ipynb ]]; then
echo "OK: resources/mnist.ipynb exists relative to sdk_tests (notebookPath is consistent)."
else
echo "WARNING: resources/mnist.ipynb not found relative to sdk_tests."
echo "RunFashionMnistCpuDistributedTraining may fail to read the notebook; consider adjusting notebookPath (e.g., \"../resources/mnist.ipynb\") or moving/linking the file."
fiLength of output: 399
Fix notebook path resolution to unblock test execution
Verification confirms the notebook path issue is blocking. The only mnist.ipynb in the repo is at tests/trainer/resources/mnist.ipynb, but notebookPath resolves to resources/mnist.ipynb relative to the test package directory (tests/trainer/sdk_tests/). This will cause os.ReadFile(localPath) to fail.
Required fix:
- Adjust
notebookPathto"../resources/" + notebookName(line ~35), or - Move/symlink the notebook to
tests/trainer/sdk_tests/resources/mnist.ipynb.
Minor observations (non-blocking):
SHARED_PVC_NAMEis exported but unused bymnist.ipynb; drop if not planned for later.- Shell command uses single-quoted string interpolation; consider env field injection on the Notebook CR to avoid quote-escaping issues.
🤖 Prompt for AI Agents
In tests/trainer/sdk_tests/fashion_mnist_tests.go around lines 33 to 115, the
notebookPath is wrong (it points to "resources/mnist.ipynb" relative to
tests/trainer/sdk_tests/) which causes os.ReadFile to fail; update the path to
refer to the actual file location by changing notebookPath to
"../resources/"+notebookName (or alternatively move/symlink the notebook into
tests/trainer/sdk_tests/resources/), then run the test to confirm os.ReadFile
succeeds.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Addition of initial e2e sanity test for kubeflow sdk
Description
As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.
How Has This Been Tested?
--
"TEST_TIER": "Sanity"--
"ODH_NAMESPACE": "opendatahub"--
"NOTEBOOK_USER_NAME": "xyz"--
"NOTEBOOK_USER_TOKEN": "some_token"--
"NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)
Merge criteria:
Summary by CodeRabbit
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.