test: initial implementation of SDK e2e by MStokluska · Pull Request #488 · opendatahub-io/distributed-workloads

MStokluska · 2025-11-06T09:30:30Z

Addition of initial e2e sanity test for kubeflow sdk

Description

As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.

How Has This Been Tested?

Provisioned my cluster and installed trainer v2 with CRDs
Created a clusterTrainingRuntime CR called "universal" that relies on universal early build image
Run the tests following trainer readme with following envs:
-- "TEST_TIER": "Sanity"
-- "ODH_NAMESPACE": "opendatahub"
-- "NOTEBOOK_USER_NAME": "xyz"
-- "NOTEBOOK_USER_TOKEN": "some_token"
-- "NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"

Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added Kubeflow trainer sanity tests and an end-to-end Fashion‑MNIST distributed training validation that runs a notebook workflow, submits a distributed training job, streams logs, and verifies success.
- Added helpers to prepare cluster resources, ensure notebook service accounts, provision PVC-backed notebook workloads, wait for notebook pods, and poll logs for execution status.
Chores
- Updated .gitignore to exclude VSCode configuration directory.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-06T09:30:42Z

Walkthrough

Adds tests and helpers to run a Kubeflow/Papermill-driven distributed FashionMNIST training: a new Jupyter notebook, orchestration test functions and utilities for notebook lifecycle and cluster preparation, a sanity test entry, and a .gitignore rule to ignore .vscode/.

Changes

Cohort / File(s)	Summary
Configuration `\.gitignore`	Added VSCode directory ignore pattern (`.vscode/*`).
Test Entry `tests/trainer/kubeflow_sdk_test.go`	New Go test `TestKubeflowSdkSanity` that marks the test Sanity and invokes the FashionMNIST distributed training flow.
Notebook Resource `tests/trainer/resources/mnist.ipynb`	New Jupyter notebook implementing end-to-end distributed FashionMNIST training (PyTorch CNN, CPU-only gloo distributed init, DDP, PVC/S3 data staging, IDX reader, 2-epoch training loop, papermill execution, and Kubeflow trainer client submission/monitoring).
SDK Test Implementation `tests/trainer/sdk_tests/fashion_mnist_tests.go`	New orchestration function `RunFashionMnistCpuDistributedTraining`: namespace creation, ensure SA/RBAC, ConfigMap from notebook bytes, RWX PVC provisioning, papermill execution command, Notebook CR creation, wait for notebook pod, poll logs for success marker, and cleanup.
Cluster Prep Utilities `tests/trainer/utils/utils_cluster_prep.go`	Added `EnsureNotebookServiceAccount(t *testing.T, test Test, namespace string)` to ensure a Notebook ServiceAccount exists (creates one if missing).
Notebook Utilities `tests/trainer/utils/utils_notebook.go`	Added helpers: `CreateNotebookFromBytes`, `WaitForNotebookPodRunning`, and `PollNotebookLogsForStatus` to create a ConfigMap+Notebook CR, wait for the notebook pod to be Running, and poll container logs for a `NOTEBOOK_STATUS: SUCCESS

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Runner
    participant K8s as Kubernetes API
    participant Notebook as Notebook Pod
    participant Trainer as Kubeflow Trainer

    Test->>K8s: Create namespace & Ensure ServiceAccount
    Test->>K8s: Create ConfigMap (notebook bytes) & RWX PVC
    Test->>K8s: Create Notebook CR (papermill command)

    rect rgb(235,247,255)
    Note over Notebook: Pod boots, installs deps, runs papermill (gloo, CPU)
    Notebook->>Trainer: Discover runtimes & submit CustomTrainer job
    Trainer->>Trainer: Launch distributed worker steps
    Trainer->>Notebook: Stream job logs/status
    Notebook->>Notebook: Emit completion marker in logs (NOTEBOOK_STATUS: SUCCESS/FAILURE)
    end

    Test->>Notebook: Poll logs until SUCCESS/FAILURE or timeout
    alt SUCCESS
        Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
    else FAILURE or timeout
        Test->>K8s: Collect logs, delete resources
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay extra attention to:
- tests/trainer/resources/mnist.ipynb — distributed PyTorch initialization (gloo), IDX reader and PVC/S3 staging, papermill flow, and trainer client integration.
- tests/trainer/sdk_tests/fashion_mnist_tests.go — namespace/RBAC handling, long waits/timeouts, pod/log polling, and cleanup correctness.
- tests/trainer/utils/utils_notebook.go and tests/trainer/utils/utils_cluster_prep.go — Kubernetes API usage, ServiceAccount creation semantics, and robust log-based status detection.

Poem

🐰 I hop through pods with a curious cheer,
Papermill hums as notebooks appear,
Workers sync on CPUs, tiny and bright,
PVCs share pixels through day and night,
A rabbit applauds these tests with delight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test: initial implementation of SDK e2e' accurately describes the main change—adding the first end-to-end sanity test for the Kubeflow SDK, which aligns with the changeset's primary objective of introducing SDK e2e testing infrastructure.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

tests/trainer/utils/utils_cluster_prep.go (2)
53-55: Consider making expected runtimes configurable.

The hardcoded list of ClusterTrainingRuntimes (torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.

Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:
-func EnsureTrainerClusterReady(t *testing.T, test Test) {
+func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) {
+	if len(expectedRuntimes) == 0 {
+		// Default runtimes if none specified
+		expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"}
+	}
 	...
-	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+	for _, name := range expectedRuntimes {
64-66: Hardcoded ServiceAccount name reduces reusability.

The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.

Consider accepting the SA name as a parameter:
-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) {
 	t.Helper()
-	saName := "jupyter-nb-kube-3aadmin"
Alternatively, if this SA name is always tied to notebooks, accept it from common.NOTEBOOK_CONTAINER_NAME explicitly to make the dependency clear.
tests/trainer/resources/mnist.ipynb (1)
17-125: Consider adding training validation.

The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.

For example, after training completes, you could add:
# At the end of train_fashion_mnist, before dist.destroy_process_group()
if dist.get_rank() == 0:
    if loss.item() > initial_loss:
        print("WARNING: Loss did not decrease during training")
    print(f"Final loss: {loss.item():.6f}")
tests/trainer/utils/utils_notebook.go (1)

49-53: Unused helper function.

CreateNotebookFromBytes is defined but not used anywhere in this PR. The test in fashion_mnist_tests.go creates the ConfigMap and Notebook separately.

Consider either:

Using this helper in fashion_mnist_tests.go to reduce duplication

Removing it if it's not needed yet

Adding a comment explaining it's for future use

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2595866 and 796d164.

📒 Files selected for processing (6)

.gitignore (1 hunks)
tests/trainer/kubeflow_sdk_test.go (1 hunks)
tests/trainer/resources/mnist.ipynb (1 hunks)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

tests/trainer/utils/utils_notebook.go (4)

tests/common/support/test.go (1)

Test (34-45)

tests/common/notebook.go (3)

ContainerSize (72-72)

CreateNotebook (104-185)

NOTEBOOK_CONTAINER_NAME (37-37)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

GetPods (111-116)

tests/common/support/support.go (1)

TestTimeoutLong (35-35)

tests/trainer/kubeflow_sdk_test.go (2)

tests/common/test_tag.go (2)

Tags (32-40)

Sanity (48-50)

tests/trainer/sdk_tests/fashion_mnist_tests.go (1)

RunFashionMnistCpuDistributedTraining (39-86)

tests/trainer/sdk_tests/fashion_mnist_tests.go (8)

tests/common/support/test.go (2)

T (90-102)

With (61-63)

tests/trainer/utils/utils_cluster_prep.go (2)

EnsureTrainerClusterReady (33-56)

EnsureNotebookRBAC (60-83)

tests/common/environment.go (2)

GetNotebookUserName (70-76)

GetNotebookUserToken (78-84)

tests/common/support/rbac.go (1)

CreateUserRoleBindingWithClusterRole (206-238)

tests/common/support/core.go (3)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

tests/trainer/utils/utils_notebook.go (3)

BuildPapermillShellCmd (36-46)

WaitForNotebookPodRunning (56-64)

PollNotebookCompletionMarker (67-85)

tests/common/notebook.go (4)

CreateNotebook (104-185)

ContainerSizeSmall (75-75)

DeleteNotebook (187-190)

Notebooks (192-204)

tests/common/support/support.go (2)

TestTimeoutLong (35-35)

TestTimeoutDouble (36-36)

tests/trainer/utils/utils_cluster_prep.go (4)

tests/common/support/test.go (2)

T (90-102)

Test (34-45)

tests/common/support/client.go (1)

Client (39-50)

tests/common/support/core.go (2)

ServiceAccount (194-200)

ServiceAccounts (207-219)

tests/common/support/rbac.go (4)

CreateClusterRole (48-70)

CreateClusterRoleBinding (135-169)

CreateRole (27-46)

CreateRoleBinding (72-102)

🪛 GitHub Actions: Verify Generated Files and Import Organization

tests/trainer/utils/utils_notebook.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

tests/trainer/utils/utils_cluster_prep.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🔇 Additional comments (5)

.gitignore (1)

2-2: LGTM!

Adding .vscode/* to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.

tests/trainer/kubeflow_sdk_test.go (1)

26-30: LGTM!

Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.

tests/trainer/sdk_tests/fashion_mnist_tests.go (2)

39-86: Well-structured test implementation.

The test follows good practices:

Proper test namespace isolation

Prerequisite checks before execution

RBAC setup

Resource cleanup with defer

Clear error messages

The flow is logical and the use of helper utilities keeps the test readable.

33-36: Relative path verified—notebook file found at expected location.

The notebook file exists at tests/trainer/resources/mnist.ipynb, confirming the relative path "resources/" resolves correctly when tests run from the package directory.

tests/trainer/utils/utils_notebook.go (1)

36-46: LGTM!

The Papermill shell command construction is clean and handles optional extra packages appropriately. The set -e ensures failures are caught, and the marker file pattern is a good approach for async completion detection.

coderabbitai · 2025-11-06T09:40:01Z

+	if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
+		t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
+	}


⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().

Apply this diff to use the Kubernetes API client:

- if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil { - t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out)) - } + _, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{ + Group: "apiextensions.k8s.io", + Version: "v1", + Resource: "customresourcedefinitions", + }).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{}) + test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the exec.Command("kubectl", ...) call with a Kubernetes API call using the provided test.Client(): call test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx, "jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g., context.Background()), and if it returns an error fail the test with t.Fatalf including the error; remove the exec.Command usage and its output handling and ensure the file imports context and metav1 (and any apiextensions client types if missing).

This is a good suggestion!
Will switch to it most likely but after comments from others.

@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.

This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.

coderabbitai · 2025-11-06T09:40:01Z

+	if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
+		t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
+	}


⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.

Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.

Apply this diff to use the Kubernetes API client with Eventually for polling:

- if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil { - t.Fatalf("Trainer controller not available: %v\n%s", err, string(out)) - } + odhNamespace := "opendatahub" // Consider making this configurable + test.Eventually(func(g Gomega) { + deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{}) + g.Expect(err).NotTo(HaveOccurred()) + g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{ + "Type": Equal(appsv1.DeploymentAvailable), + "Status": Equal(corev1.ConditionTrue), + }))) + }, 180*time.Second).Should(Succeed(), "Trainer controller not available")

Committable suggestion skipped: line range outside the PR's diff.

same as above

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Fetch all review comments from this PR to find the "above" comment being referenced gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50

Length of output: 25

🏁 Script executed:

#!/bin/bash # Try fetching review comments using the GitHub API directly gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100

Length of output: 18212

@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.

🐰✨

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tests/trainer/utils/utils_notebook.go (2)

17-32: Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted. Run make imports to fix the import organization.

66-85: Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.

🧹 Nitpick comments (1)

tests/trainer/utils/utils_notebook.go (1)

55-64: Eliminate redundant GetPods call.

Lines 62-63 repeat the same GetPods call that was already executed in the Eventually block (lines 58-60), which is inefficient.

Apply this diff to store and reuse the pods result:

 func WaitForNotebookPodRunning(test Test, namespace string) (string, string) {
 	labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME)
+	var pods []corev1.Pod
 	test.Eventually(func() []corev1.Pod {
-		return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		return pods
 	}, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod")
 
-	pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
 	return pods[0].Name, pods[0].Spec.Containers[0].Name
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796d164 and 4e71919.

📒 Files selected for processing (2)

tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/trainer/utils/utils_cluster_prep.go

🧰 Additional context used

🧬 Code graph analysis (1)

tests/trainer/utils/utils_notebook.go (4)

tests/common/support/test.go (1)

Test (34-45)

tests/common/notebook.go (3)

ContainerSize (72-72)

CreateNotebook (104-185)

NOTEBOOK_CONTAINER_NAME (37-37)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

GetPods (111-116)

tests/common/support/support.go (1)

TestTimeoutLong (35-35)

🔇 Additional comments (2)

tests/trainer/utils/utils_notebook.go (2)

34-46: LGTM! Papermill command construction is sound.

The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses sleep infinity to keep the container running for inspection. The use of set -e ensures proper error propagation.

48-53: LGTM! Notebook setup orchestration is correct.

The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.

sutaakar · 2025-11-06T12:15:06Z

+    "        dataset = datasets.FashionMNIST(\n",
+    "            \"./data\",\n",
+    "            train=True,\n",
+    "            download=True,\n",


This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67

sutaakar · 2025-11-06T12:39:27Z

+    "for runtime in client.list_runtimes():\n",
+    "    print(runtime)\n",
+    "    if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
+    "        torch_runtime = runtime"


Why this approach instead of getting universal runtime by client.get_runtime("universal")?

That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!

sutaakar · 2025-11-06T12:40:58Z

+    "            \"cpu\": 2,\n",
+    "            \"memory\": \"8Gi\",\n",
+    "        },\n",
+    "        packages_to_install=[\"torchvision\"],\n",


I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.

yes I think it makes sense. Thanks.

sutaakar · 2025-11-06T13:01:47Z

+	}
+	// TODO: Extend / tweak with universal image runtime once available
+	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+		test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name))


Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?

Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?

Yes, can be removed

sutaakar · 2025-11-06T13:08:12Z

+
+// EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can
+// read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace.
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {


I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.

Agree, thanks Karel

sutaakar · 2025-11-06T13:30:01Z

+// PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout.
+func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
+	var finalErr error
+	test.Eventually(func() bool {


Thinking about whether it would be simpler to print status into Workbench log directly.

It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?

sutaakar · 2025-11-06T13:46:50Z

+
+	// Wait for the Notebook Pod and get pod/container names
+	podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)
+


Would it have sense to add assertion checking that TrainJob is created and successfully finished?

Yes, I think this is very important :)
Thanks

Hey, as agreed, we are going to leave it out.

MStokluska · 2025-11-18T15:39:45Z

hey @sutaakar
I've addressed the comments and had to implement a couple of workarounds...

Temporarily avoided torchvision downloads due to a current SDK issue; once fix: pip install with --user argument fails with image running in python virtual environment kubeflow/sdk#162 is merged/released, we can simplify and switch back to torchvision.
Switched to boto3.get_object streaming with short socket timeouts to work around MinIO download issues (timeouts observed with direct/TransferManager downloads). This is defensive; S3 may not exhibit the same behavior.
Targeting the runtime named torch-distributed per midstream changes (Add training runtimes trainer#21). If a different runtime name is configured, will have to adjust accordingly.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/trainer/utils/utils_notebook.go (1)
34-39: Helpers are solid; consider eventually replacing kubectl logs with Kubernetes client calls.

The three helpers fit together nicely:

CreateNotebookFromBytes abstracts ConfigMap + Notebook + PVC creation.

WaitForNotebookPodRunning correctly waits for a single Running notebook pod using the shared label/constant.

PollNotebookLogsForStatus gives a clean SUCCESS/FAILURE signal based on log markers.

One medium-term improvement (already raised earlier) is the reliance on exec.Command("kubectl", ...) in PollNotebookLogsForStatus. It couples the tests to an external binary and kubeconfig, even though you already have a Kubernetes client in test.Client() that can stream logs directly.

When these tests move from “temporary” to more permanent, it would be worth switching to a client-go–based log reader (or pod exec) to remove the kubectl dependency and keep everything within the API client.
#!/bin/bash
# Inventory remaining direct kubectl usages to clean up once the approach is finalized.
rg -n 'exec.Command\("kubectl"' tests -C2 || true
Also applies to: 41-50, 52-71

🧹 Nitpick comments (2)

tests/trainer/utils/utils_cluster_prep.go (1)

31-38: ServiceAccount helper looks good; consider reusing shared constant for the name.

The creation logic and IsAlreadyExists handling are solid. To avoid future drift with other helpers (e.g., common.NOTEBOOK_CONTAINER_NAME used for notebook naming and labels), consider sourcing saName from that shared constant instead of duplicating the literal "jupyter-nb-kube-3aadmin".

This keeps the SA name aligned across notebook creation, deletion, and pod selection helpers.
tests/trainer/resources/mnist.ipynb (1)
47-79: Align comment with current behavior and tidy unused unpacked variable.

Two small polish items in the training cell:

The comment:
# Prefer shared PVC if present; else fallback to internet download (rank 0 only)
no longer matches the logic:
use_pvc = os.path.isdir(pvc_raw) and any(os.scandir(pvc_raw))

if not use_pvc:
    raise RuntimeError("Shared PVC not mounted or empty at /mnt/shared/FashionMNIST/raw; this test requires a pre-populated RWX PVC")
Since the function now strictly requires a pre-populated PVC (and upstream cells handle S3/HTTP fallback), consider updating the comment to reflect that requirement to avoid confusion for future readers.

In _read_idx_labels, num is unpacked but unused:
magic, num = struct.unpack(">II", f.read(8))
Changing this to
magic, _ = struct.unpack(">II", f.read(8))
(or similar) makes the intent explicit and keeps linters quiet.

These are non-functional cleanups but will make the notebook easier to maintain.

Also applies to: 83-97

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e71919 and 5f22561.

📒 Files selected for processing (5)

tests/trainer/kubeflow_sdk_test.go (1 hunks)
tests/trainer/resources/mnist.ipynb (1 hunks)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/trainer/kubeflow_sdk_test.go

🧰 Additional context used

🧬 Code graph analysis (3)

tests/trainer/sdk_tests/fashion_mnist_tests.go (11)

tests/common/support/test.go (2)

T (90-102)

With (61-63)

tests/trainer/utils/utils_cluster_prep.go (1)

EnsureNotebookServiceAccount (31-39)

tests/common/environment.go (2)

GetNotebookUserName (70-76)

GetNotebookUserToken (78-84)

tests/common/support/rbac.go (1)

CreateUserRoleBindingWithClusterRole (206-238)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

StorageClassName (228-233)

tests/common/support/environment.go (5)

GetStorageBucketDefaultEndpoint (152-155)

GetStorageBucketAccessKeyId (162-165)

GetStorageBucketSecretKey (167-170)

GetStorageBucketName (172-175)

GetStorageBucketMnistDir (177-180)

tests/common/support/storage.go (1)

GetRWXStorageClass (23-36)

tests/common/support/config.go (1)

GetOpenShiftApiUrl (44-56)

tests/common/notebook.go (4)

CreateNotebook (104-185)

ContainerSizeSmall (75-75)

DeleteNotebook (187-190)

Notebooks (192-204)

tests/common/support/support.go (2)

TestTimeoutLong (35-35)

TestTimeoutDouble (36-36)

tests/trainer/utils/utils_notebook.go (2)

WaitForNotebookPodRunning (42-50)

PollNotebookLogsForStatus (53-72)

tests/trainer/utils/utils_notebook.go (4)

tests/common/support/test.go (1)

Test (34-45)

tests/common/notebook.go (3)

ContainerSize (72-72)

CreateNotebook (104-185)

NOTEBOOK_CONTAINER_NAME (37-37)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

GetPods (111-116)

tests/common/support/support.go (1)

TestTimeoutLong (35-35)

tests/trainer/utils/utils_cluster_prep.go (3)

tests/common/support/test.go (2)

T (90-102)

Test (34-45)

tests/common/support/core.go (2)

ServiceAccount (194-200)

ServiceAccounts (207-219)

tests/common/support/client.go (1)

Client (39-50)

🪛 Ruff (0.14.5)

tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)

70-70: Avoid specifying long messages outside the exception class

(TRY003)

76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

78-78: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Do not catch blind exception: Exception

(BLE001)

207-207: Do not catch blind exception: Exception

(BLE001)

298-298: Do not catch blind exception: Exception

(BLE001)

303-305: try-except-pass detected, consider logging the exception

(S110)

303-303: Do not catch blind exception: Exception

(BLE001)

311-311: Do not catch blind exception: Exception

(BLE001)

348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

358-358: Do not catch blind exception: Exception

(BLE001)

392-392: Do not catch blind exception: Exception

(BLE001)

398-398: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

398-398: Avoid specifying long messages outside the exception class

(TRY003)

briangallagher · 2025-11-19T07:54:31Z

@MStokluska Instead of copying the notebook, would it be possible to reference it, in a similar way to what we do for PR testing here?

MStokluska · 2025-11-19T08:12:49Z

@MStokluska Instead of copying the notebook, would it be possible to reference it, in a similar way to what we do for PR testing here?

With the amount of edits required - not really. For example - the shared PV between notebook/trainer workers is needed due to how we download dataset to prep for isolated envs (and there's more subtle differences..)

sutaakar · 2025-11-19T09:20:26Z

+    "        print(\"[notebook] Internet download completed successfully\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"[notebook] Internet download failed: {e}\")\n",
+    "        print(\"[notebook] WARNING: Dataset may be incomplete!\")\n",


I would expect the test to fail in this case, WDYT?
it doesn't have sense to proceed without dataset.

True, will fix it up. Thanks Karel

sutaakar · 2025-11-19T09:24:57Z

+    "    torch_runtime = client.get_runtime(\"torch-distributed\")\n",
+    "except Exception:\n",
+    "    torch_runtime = next(\n",
+    "        (r for r in client.list_runtimes() if r.name == \"torch-distributed\"),\n",


What is the purpose of this?
I would expect that if client.get_runtime(\"torch-distributed\") fails then torch-distributed doesn't exist.

Thinking whether it would have sense to pass runtime as parameter/env variable for the notebook.
Will this notebook run for both CUDA and ROCm runtimes?

Edit - the notebook doesn't use GPU, so no reason to check specific runtimes.

So the expectation here is that the "torch distributed" will be available.
The additional check is just in case there's something wrong with the get_runtime vs list_runtimes - but now that you mentioned it, I think it might make more sense to just get_runtime and fail if not there.

sutaakar · 2025-11-19T09:41:49Z

+func PollNotebookLogsForStatus(test Test, namespace, podName, containerName string, timeout time.Duration) error {
+	var finalErr error
+	test.Eventually(func() bool {
+		out, err := exec.Command("kubectl", "-n", namespace, "logs", podName, "-c", containerName, "--tail=2000").CombinedOutput()


You can reuse PodLog function here - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/common/support/core.go#L118

i.e. https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/odh/raytune_oai_mr_grpc_test.go#L144-L145

Awesome, thank you.

MStokluska · 2025-11-21T10:43:38Z

@sutaakar I think all comments are addressed, let me know if you think there's anything else that I should improve. Thanks!

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
112-114: Refactor to idiomatic Gomega assertion.

The current pattern is non-idiomatic: checking if err != nil and then expecting it To(Succeed()) will always fail when err is non-nil.

Apply this refactor for clarity:
-	if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {
-		test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")
-	}
+	err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)
+	test.Expect(err).NotTo(HaveOccurred(), "Notebook execution reported FAILURE")

🧹 Nitpick comments (1)

tests/trainer/resources/mnist.ipynb (1)

401-421: Consider pinning the Kubeflow Trainer SDK version.

For test reproducibility, you may want to pin the kubeflow-trainer package version in the notebook environment or shell command (similar to how boto3==1.34.162 is pinned on line 90 of fashion_mnist_tests.go). This ensures future SDK changes don't unexpectedly break the test.

If testing against the latest SDK version is intentional, you can safely ignore this suggestion.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f22561 and 8ae0dee.

📒 Files selected for processing (6)

.gitignore (1 hunks)
tests/trainer/kubeflow_sdk_test.go (1 hunks)
tests/trainer/resources/mnist.ipynb (1 hunks)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
tests/trainer/utils/utils_cluster_prep.go (1 hunks)
tests/trainer/utils/utils_notebook.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

.gitignore
tests/trainer/kubeflow_sdk_test.go
tests/trainer/utils/utils_notebook.go

🧰 Additional context used

🧬 Code graph analysis (2)

tests/trainer/sdk_tests/fashion_mnist_tests.go (11)

tests/common/support/test.go (2)

T (90-102)

With (61-63)

tests/trainer/utils/utils_cluster_prep.go (1)

EnsureNotebookServiceAccount (31-39)

tests/common/environment.go (2)

GetNotebookUserName (70-76)

GetNotebookUserToken (78-84)

tests/common/support/rbac.go (1)

CreateUserRoleBindingWithClusterRole (206-238)

tests/common/support/core.go (4)

CreateConfigMap (33-54)

CreatePersistentVolumeClaim (242-276)

AccessModes (235-240)

StorageClassName (228-233)

tests/common/support/environment.go (5)

GetStorageBucketDefaultEndpoint (152-155)

GetStorageBucketAccessKeyId (162-165)

GetStorageBucketSecretKey (167-170)

GetStorageBucketName (172-175)

GetStorageBucketMnistDir (177-180)

tests/common/support/storage.go (1)

GetRWXStorageClass (23-36)

tests/common/support/config.go (1)

GetOpenShiftApiUrl (44-56)

tests/common/notebook.go (4)

CreateNotebook (104-185)

ContainerSizeSmall (75-75)

DeleteNotebook (187-190)

Notebooks (192-204)

tests/common/support/support.go (2)

TestTimeoutLong (35-35)

TestTimeoutDouble (36-36)

tests/trainer/utils/utils_notebook.go (2)

WaitForNotebookPodRunning (41-49)

PollNotebookLogsForStatus (52-81)

tests/trainer/utils/utils_cluster_prep.go (3)

tests/common/support/test.go (2)

T (90-102)

Test (34-45)

tests/common/support/core.go (2)

ServiceAccount (194-200)

ServiceAccounts (207-219)

tests/common/support/client.go (1)

Client (40-52)

🪛 Ruff (0.14.5)

tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)

70-70: Avoid specifying long messages outside the exception class

(TRY003)

76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

78-78: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Do not catch blind exception: Exception

(BLE001)

207-207: Do not catch blind exception: Exception

(BLE001)

298-298: Do not catch blind exception: Exception

(BLE001)

303-305: try-except-pass detected, consider logging the exception

(S110)

303-303: Do not catch blind exception: Exception

(BLE001)

311-311: Do not catch blind exception: Exception

(BLE001)

348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

358-358: Do not catch blind exception: Exception

(BLE001)

360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

360-360: Avoid specifying long messages outside the exception class

(TRY003)

393-393: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

tests/trainer/utils/utils_cluster_prep.go (1)

29-39: LGTM! Clean helper implementation.

The function correctly ensures the ServiceAccount exists, properly handles AlreadyExists errors, and fails with a clear message on genuine creation failures. The implementation is straightforward and follows Go testing best practices.

sutaakar · 2025-11-24T13:07:58Z

+	if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {
+		test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")
+	}


Suggested change

if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {

test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")

}

err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)

test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE")

sutaakar · 2025-11-24T13:09:46Z

+	sdktests "github.com/opendatahub-io/distributed-workloads/tests/trainer/sdk_tests"
+)
+
+func TestKubeflowSDK_Sanity(t *testing.T) {


Suggested change

func TestKubeflowSDK_Sanity(t *testing.T) {

func TestKubeflowSdkSanity(t *testing.T) {

sutaakar · 2025-11-24T14:00:44Z

@MStokluska I ran the test using NOTEBOOK_IMAGE=quay.io/bgallagher/universal-image:v2 and got error:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 5
      3 from kubernetes import client as k8s_client
      4 from kubeflow.trainer import TrainerClient
----> 5 from kubeflow.common.types import KubernetesBackendConfig
      7 openshift_api_url = os.getenv("OPENSHIFT_API_URL", "")
      8 token = os.getenv("NOTEBOOK_TOKEN", "")

ModuleNotFoundError: No module named 'kubeflow.common'

NOTEBOOK_STATUS: FAILURE

MStokluska · 2025-11-24T14:59:45Z

@MStokluska I ran the test using NOTEBOOK_IMAGE=quay.io/bgallagher/universal-image:v2 and got error:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 5
      3 from kubernetes import client as k8s_client
      4 from kubeflow.trainer import TrainerClient
----> 5 from kubeflow.common.types import KubernetesBackendConfig
      7 openshift_api_url = os.getenv("OPENSHIFT_API_URL", "")
      8 token = os.getenv("NOTEBOOK_TOKEN", "")

ModuleNotFoundError: No module named 'kubeflow.common'

NOTEBOOK_STATUS: FAILURE

@sutaakar please run it with: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:odh-training-th03-157dbcf42c0dee17933360e343bfafe6-build-images or :latest

Brians image might be outdated.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/trainer/resources/mnist.ipynb (3)

17-155: Distributed training function and IDX dataset reader look good; minor cleanups possible

The CPU‑only DDP setup, strict PVC requirement at /mnt/shared/FashionMNIST/raw, and IDX dataset handling all look consistent with the e2e test design and the shared PVC contract.

Two small nits:

In _read_idx_labels, the unpacked num header value is never used; consider renaming it to _/_num or adding a simple integrity check using it to satisfy linters.

The imports from urllib.parse import urlparse and import gzip, shutil in this cell are now unused and can be dropped to keep the cell minimal.

163-379: S3/HTTP dataset provisioning flow is reasonable; exception handling could be slightly tightened

The dataset prep logic—writing into FASHION_RAW_DIR on the notebook’s PVC, streaming S3/MinIO downloads with short timeouts, decompressing .gz files, and finally falling back to the Fashion‑MNIST HTTP endpoint only if needed—provides a good balance between robustness and test friendliness. The final RuntimeError("Internet download failed; aborting test") ensures the test fails rather than training without data, which aligns with earlier feedback.

Given the static analysis hints, the only tweaks you might consider (non‑blocking):

Several except Exception as e handlers (S3 fetch, pagination, decompression, HTTP download) are intentionally broad. For clarity and to quiet BLE001/S110, you could either narrow them to the specific client/library error types you expect or at least log type(e)/repr(e) before returning/raising, so failures are more diagnosable.

Where you intentionally swallow errors (e.g., failing to os.remove(dst) after successful decompression), a brief comment as you already have (“Not critical...”) is enough; no change needed there.

Overall, the behavior is fine for test code; these are just lint‑driven niceties.
Based on static analysis hints.

399-556: Kubeflow SDK integration and job lifecycle handling align with the test’s intent; consider future assertions

The SDK client setup with OPENSHIFT_API_URL/NOTEBOOK_TOKEN, runtime lookup for "torch-distributed" via client.get_runtime(...), and submission of a 2‑node CustomTrainer job with the RWX PVC mounted at /mnt/shared all look coherent with the notebook’s train_fashion_mnist() expectations.

Job lifecycle cells (wait for Running/Complete, print step summaries, stream logs, then delete_job) are appropriate for a sanity‑level e2e. If you later want stronger validation, a possible follow‑up would be to assert on step/job statuses in the notebook itself (rather than only printing) to make failures more explicit, but that’s optional given the current log‑driven Go test harness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae0dee and 098deb6.

📒 Files selected for processing (3)

tests/trainer/kubeflow_sdk_test.go (1 hunks)
tests/trainer/resources/mnist.ipynb (1 hunks)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/trainer/kubeflow_sdk_test.go

🧰 Additional context used

🪛 Ruff (0.14.5)

tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)

70-70: Avoid specifying long messages outside the exception class

(TRY003)

76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

78-78: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Do not catch blind exception: Exception

(BLE001)

207-207: Do not catch blind exception: Exception

(BLE001)

298-298: Do not catch blind exception: Exception

(BLE001)

303-305: try-except-pass detected, consider logging the exception

(S110)

303-303: Do not catch blind exception: Exception

(BLE001)

311-311: Do not catch blind exception: Exception

(BLE001)

348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

358-358: Do not catch blind exception: Exception

(BLE001)

360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

360-360: Avoid specifying long messages outside the exception class

(TRY003)

393-393: Avoid specifying long messages outside the exception class

(TRY003)

coderabbitai · 2025-11-24T17:33:09Z

+const (
+	notebookName = "mnist.ipynb"
+	notebookPath = "resources/" + notebookName
+)
+
+// CPU Only - Distributed Training
+func RunFashionMnistCpuDistributedTraining(t *testing.T) {
+	test := support.With(t)
+
+	// Create a new test namespace
+	namespace := test.NewTestNamespace()
+
+	// Ensure Notebook ServiceAccount exists (no extra RBAC)
+	trainerutils.EnsureNotebookServiceAccount(t, test, namespace.Name)
+
+	// RBACs setup
+	userName := common.GetNotebookUserName(test)
+	userToken := common.GetNotebookUserToken(test)
+	support.CreateUserRoleBindingWithClusterRole(test, userName, namespace.Name, "admin")
+
+	// Create ConfigMap with notebook
+	localPath := notebookPath
+	nb, err := os.ReadFile(localPath)
+	test.Expect(err).NotTo(HaveOccurred(), fmt.Sprintf("failed to read notebook: %s", localPath))
+	cm := support.CreateConfigMap(test, namespace.Name, map[string][]byte{notebookName: nb})
+
+	// Build command with parameters and pinned deps, and print definitive status line to logs
+	endpoint, endpointOK := support.GetStorageBucketDefaultEndpoint()
+	accessKey, _ := support.GetStorageBucketAccessKeyId()
+	secretKey, _ := support.GetStorageBucketSecretKey()
+	bucket, bucketOK := support.GetStorageBucketName()
+	prefix, _ := support.GetStorageBucketMnistDir()
+	if !endpointOK {
+		endpoint = ""
+	}
+	if !bucketOK {
+		bucket = ""
+	}
+	// Create RWX PVC for shared dataset and pass the claim name to the notebook
+	storageClass, err := support.GetRWXStorageClass(test)
+	test.Expect(err).NotTo(HaveOccurred(), "Failed to find an RWX supporting StorageClass")
+	rwxPvc := support.CreatePersistentVolumeClaim(
+		test,
+		namespace.Name,
+		"20Gi",
+		support.AccessModes(corev1.ReadWriteMany),
+		support.StorageClassName(storageClass.Name),
+	)
+
+	shellCmd := fmt.Sprintf(
+		"set -e; "+
+			"export OPENSHIFT_API_URL='%s'; export NOTEBOOK_TOKEN='%s'; "+
+			"export NOTEBOOK_NAMESPACE='%s'; "+
+			"export SHARED_PVC_NAME='%s'; "+
+			"export AWS_DEFAULT_ENDPOINT='%s'; export AWS_ACCESS_KEY_ID='%s'; "+
+			"export AWS_SECRET_ACCESS_KEY='%s'; export AWS_STORAGE_BUCKET='%s'; "+
+			"export AWS_STORAGE_BUCKET_MNIST_DIR='%s'; "+
+			"python -m pip install --quiet --no-cache-dir papermill boto3==1.34.162 && "+
+			"if python -m papermill -k python3 /opt/app-root/notebooks/%s /opt/app-root/src/out.ipynb --log-output; "+
+			"then echo 'NOTEBOOK_STATUS: SUCCESS'; else echo 'NOTEBOOK_STATUS: FAILURE'; fi; sleep infinity",
+		support.GetOpenShiftApiUrl(test), userToken, namespace.Name, rwxPvc.Name,
+		endpoint, accessKey, secretKey, bucket, prefix,
+		notebookName,
+	)
+	command := []string{"/bin/sh", "-c", shellCmd}
+
+	// Create Notebook CR using the RWX PVC
+	common.CreateNotebook(test, namespace, userToken, command, cm.Name, notebookName, 0, rwxPvc, common.ContainerSizeSmall)
+
+	// Cleanup
+	defer func() {
+		common.DeleteNotebook(test, namespace)
+		test.Eventually(common.Notebooks(test, namespace), support.TestTimeoutLong).Should(HaveLen(0))
+	}()
+
+	// Wait for the Notebook Pod and get pod/container names
+	podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)
+
+	// Poll logs to check if the notebook execution completed successfully
+	err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)
+	test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE")
+
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

End‑to‑end test wiring looks good; please verify notebook path and consider a couple of small refinements

The overall flow of RunFashionMnistCpuDistributedTraining is solid: new namespace, Notebook SA and RBAC, RWX PVC for shared data, ConfigMap‑backed notebook, papermill execution with NOTEBOOK_STATUS markers, and cleanup with a final assertion that no notebooks remain.

A few details to double‑check/tidy up:

Notebook path resolution

notebookPath is defined as:

const ( notebookName = "mnist.ipynb" notebookPath = "resources/" + notebookName )

while the notebook added in this PR lives at tests/trainer/resources/mnist.ipynb. If there isn’t also a tests/trainer/sdk_tests/resources/mnist.ipynb (or symlink) in the repo, running the tests from the sdk_tests package will fail with “failed to read notebook: resources/mnist.ipynb”. Please confirm the actual layout and either adjust notebookPath (e.g. "../resources/"+notebookName) or move/link the file so the relative path is correct.

SHARED_PVC_NAME env

You export SHARED_PVC_NAME into the notebook container, but the current mnist.ipynb doesn’t read this variable (it uses fixed mount points /opt/app-root/src and /mnt/shared). That env var is therefore redundant right now; feel free to drop it unless you plan to consume it in the notebook later.

Shell command robustness

The shellCmd uses single‑quoted export VAR='%s' substitutions. If any of the substituted values ever contains a single quote, the shell will misparse the command. It’s unlikely for most of these (endpoint, bucket, access key), but not impossible for tokens. Longer term, it might be more robust to:

Pass these values via the Notebook CR’s env fields instead of through the shell, or

Add proper shell escaping around the interpolated values.

Not urgent, but worth considering if this test is run in more varied environments.

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Searching for mnist.ipynb locations under the repo root..." fd -a 'mnist.ipynb' . echo echo "Checking path resolution from tests/trainer/sdk_tests..." cd tests/trainer/sdk_tests if [[ -f resources/mnist.ipynb ]]; then echo "OK: resources/mnist.ipynb exists relative to sdk_tests (notebookPath is consistent)." else echo "WARNING: resources/mnist.ipynb not found relative to sdk_tests." echo "RunFashionMnistCpuDistributedTraining may fail to read the notebook; consider adjusting notebookPath (e.g., \"../resources/mnist.ipynb\") or moving/linking the file." fi

Length of output: 399

Fix notebook path resolution to unblock test execution

Verification confirms the notebook path issue is blocking. The only mnist.ipynb in the repo is at tests/trainer/resources/mnist.ipynb, but notebookPath resolves to resources/mnist.ipynb relative to the test package directory (tests/trainer/sdk_tests/). This will cause os.ReadFile(localPath) to fail.

Required fix:

Adjust notebookPath to "../resources/" + notebookName (line ~35), or

Move/symlink the notebook to tests/trainer/sdk_tests/resources/mnist.ipynb.

Minor observations (non-blocking):

SHARED_PVC_NAME is exported but unused by mnist.ipynb; drop if not planned for later.

Shell command uses single-quoted string interpolation; consider env field injection on the Notebook CR to avoid quote-escaping issues.

🤖 Prompt for AI Agents

In tests/trainer/sdk_tests/fashion_mnist_tests.go around lines 33 to 115, the notebookPath is wrong (it points to "resources/mnist.ipynb" relative to tests/trainer/sdk_tests/) which causes os.ReadFile to fail; update the path to refer to the actual file location by changing notebookPath to "../resources/"+notebookName (or alternatively move/symlink the notebook into tests/trainer/sdk_tests/resources/), then run the test to confirm os.ReadFile succeeds.

sutaakar

lgtm, thanks

openshift-ci · 2025-11-25T14:06:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sutaakar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci Bot requested review from kapil27 and pawelpaszki November 6, 2025 09:30

coderabbitai Bot reviewed Nov 6, 2025

View reviewed changes

sutaakar reviewed Nov 6, 2025

View reviewed changes

coderabbitai Bot reviewed Nov 18, 2025

View reviewed changes

Comment thread tests/trainer/sdk_tests/fashion_mnist_tests.go Outdated

sutaakar reviewed Nov 19, 2025

View reviewed changes

MStokluska added 4 commits November 21, 2025 11:38

test: initial implementation of SDK e2e

6f7b02c

cleanup imports

aa1afeb

address comments

f7b0285

address comments

8ae0dee

MStokluska force-pushed the sdk_tests branch from 70f3c3e to 8ae0dee Compare November 21, 2025 10:39

coderabbitai Bot reviewed Nov 21, 2025

View reviewed changes

sutaakar reviewed Nov 24, 2025

View reviewed changes

address comments and make training CPU only

098deb6

coderabbitai Bot reviewed Nov 24, 2025

View reviewed changes

sutaakar approved these changes Nov 25, 2025

View reviewed changes

openshift-ci Bot assigned sutaakar Nov 25, 2025

openshift-ci Bot added the lgtm label Nov 25, 2025

openshift-ci Bot added the approved label Nov 25, 2025

openshift-merge-bot Bot merged commit 39f5ccd into opendatahub-io:main Nov 25, 2025
5 checks passed


		// Wait for the Notebook Pod and get pod/container names
		podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)

	func TestKubeflowSDK_Sanity(t *testing.T) {
	func TestKubeflowSdkSanity(t *testing.T) {

Conversation

MStokluska commented Nov 6, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MStokluska commented Nov 6, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 6, 2025 •

edited

Loading

coderabbitai Bot Nov 6, 2025 •

edited

Loading

coderabbitai Bot Nov 6, 2025 •

edited

Loading

MStokluska commented Nov 18, 2025 •

edited

Loading

sutaakar Nov 19, 2025 •

edited

Loading