Skip to content

test: initial implementation of SDK e2e#488

Merged
openshift-merge-bot[bot] merged 5 commits intoopendatahub-io:mainfrom
MStokluska:sdk_tests
Nov 25, 2025
Merged

test: initial implementation of SDK e2e#488
openshift-merge-bot[bot] merged 5 commits intoopendatahub-io:mainfrom
MStokluska:sdk_tests

Conversation

@MStokluska
Copy link
Copy Markdown
Contributor

@MStokluska MStokluska commented Nov 6, 2025

Addition of initial e2e sanity test for kubeflow sdk

Description

As part of this work we are adding the initial e2e for kubeflow SDK (2 nodes cpu mnist fashion training).
The goal of it is to make it as easy as possible to extend with further notebook based tests.

How Has This Been Tested?

  • Provisioned my cluster and installed trainer v2 with CRDs
  • Created a clusterTrainingRuntime CR called "universal" that relies on universal early build image
  • Run the tests following trainer readme with following envs:
    -- "TEST_TIER": "Sanity"
    -- "ODH_NAMESPACE": "opendatahub"
    -- "NOTEBOOK_USER_NAME": "xyz"
    -- "NOTEBOOK_USER_TOKEN": "some_token"
    -- "NOTEBOOK_IMAGE": "quay.io/bgallagher/universal-image:v2"

Note: The PR will not work without trainer v2 controller, CRDs and custom clusterTrainingRuntime created that uses the universal image under the hood (our dev image: quay.io/bgallagher/universal-image:v2)

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests

    • Added Kubeflow trainer sanity tests and an end-to-end Fashion‑MNIST distributed training validation that runs a notebook workflow, submits a distributed training job, streams logs, and verifies success.
    • Added helpers to prepare cluster resources, ensure notebook service accounts, provision PVC-backed notebook workloads, wait for notebook pods, and poll logs for execution status.
  • Chores

    • Updated .gitignore to exclude VSCode configuration directory.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci Bot requested review from kapil27 and pawelpaszki November 6, 2025 09:30
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 6, 2025

Walkthrough

Adds tests and helpers to run a Kubeflow/Papermill-driven distributed FashionMNIST training: a new Jupyter notebook, orchestration test functions and utilities for notebook lifecycle and cluster preparation, a sanity test entry, and a .gitignore rule to ignore .vscode/.

Changes

Cohort / File(s) Summary
Configuration
\.gitignore
Added VSCode directory ignore pattern (.vscode/*).
Test Entry
tests/trainer/kubeflow_sdk_test.go
New Go test TestKubeflowSdkSanity that marks the test Sanity and invokes the FashionMNIST distributed training flow.
Notebook Resource
tests/trainer/resources/mnist.ipynb
New Jupyter notebook implementing end-to-end distributed FashionMNIST training (PyTorch CNN, CPU-only gloo distributed init, DDP, PVC/S3 data staging, IDX reader, 2-epoch training loop, papermill execution, and Kubeflow trainer client submission/monitoring).
SDK Test Implementation
tests/trainer/sdk_tests/fashion_mnist_tests.go
New orchestration function RunFashionMnistCpuDistributedTraining: namespace creation, ensure SA/RBAC, ConfigMap from notebook bytes, RWX PVC provisioning, papermill execution command, Notebook CR creation, wait for notebook pod, poll logs for success marker, and cleanup.
Cluster Prep Utilities
tests/trainer/utils/utils_cluster_prep.go
Added EnsureNotebookServiceAccount(t *testing.T, test Test, namespace string) to ensure a Notebook ServiceAccount exists (creates one if missing).
Notebook Utilities
tests/trainer/utils/utils_notebook.go
Added helpers: CreateNotebookFromBytes, WaitForNotebookPodRunning, and PollNotebookLogsForStatus to create a ConfigMap+Notebook CR, wait for the notebook pod to be Running, and poll container logs for a `NOTEBOOK_STATUS: SUCCESS

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Runner
    participant K8s as Kubernetes API
    participant Notebook as Notebook Pod
    participant Trainer as Kubeflow Trainer

    Test->>K8s: Create namespace & Ensure ServiceAccount
    Test->>K8s: Create ConfigMap (notebook bytes) & RWX PVC
    Test->>K8s: Create Notebook CR (papermill command)

    rect rgb(235,247,255)
    Note over Notebook: Pod boots, installs deps, runs papermill (gloo, CPU)
    Notebook->>Trainer: Discover runtimes & submit CustomTrainer job
    Trainer->>Trainer: Launch distributed worker steps
    Trainer->>Notebook: Stream job logs/status
    Notebook->>Notebook: Emit completion marker in logs (NOTEBOOK_STATUS: SUCCESS/FAILURE)
    end

    Test->>Notebook: Poll logs until SUCCESS/FAILURE or timeout
    alt SUCCESS
        Test->>K8s: Delete Notebook, PVC, ConfigMap (cleanup)
    else FAILURE or timeout
        Test->>K8s: Collect logs, delete resources
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to:
    • tests/trainer/resources/mnist.ipynb — distributed PyTorch initialization (gloo), IDX reader and PVC/S3 staging, papermill flow, and trainer client integration.
    • tests/trainer/sdk_tests/fashion_mnist_tests.go — namespace/RBAC handling, long waits/timeouts, pod/log polling, and cleanup correctness.
    • tests/trainer/utils/utils_notebook.go and tests/trainer/utils/utils_cluster_prep.go — Kubernetes API usage, ServiceAccount creation semantics, and robust log-based status detection.

Poem

🐰 I hop through pods with a curious cheer,
Papermill hums as notebooks appear,
Workers sync on CPUs, tiny and bright,
PVCs share pixels through day and night,
A rabbit applauds these tests with delight.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test: initial implementation of SDK e2e' accurately describes the main change—adding the first end-to-end sanity test for the Kubeflow SDK, which aligns with the changeset's primary objective of introducing SDK e2e testing infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
tests/trainer/utils/utils_cluster_prep.go (2)

53-55: Consider making expected runtimes configurable.

The hardcoded list of ClusterTrainingRuntimes (torch-cuda-241, etc.) conflicts with the PR's stated requirement for a "universal" runtime, and the TODO comment suggests this list will change.

Consider accepting expected runtime names as a parameter or reading from an environment variable to make the test more flexible:

-func EnsureTrainerClusterReady(t *testing.T, test Test) {
+func EnsureTrainerClusterReady(t *testing.T, test Test, expectedRuntimes ...string) {
+	if len(expectedRuntimes) == 0 {
+		// Default runtimes if none specified
+		expectedRuntimes = []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"}
+	}
 	...
-	for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
+	for _, name := range expectedRuntimes {

64-66: Hardcoded ServiceAccount name reduces reusability.

The ServiceAccount name "jupyter-nb-kube-3aadmin" is hardcoded and matches common.NOTEBOOK_CONTAINER_NAME, but this tight coupling limits flexibility for other test scenarios.

Consider accepting the SA name as a parameter:

-func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
+func EnsureNotebookRBAC(t *testing.T, test Test, namespace string, saName string) {
 	t.Helper()
-	saName := "jupyter-nb-kube-3aadmin"

Alternatively, if this SA name is always tied to notebooks, accept it from common.NOTEBOOK_CONTAINER_NAME explicitly to make the dependency clear.

tests/trainer/resources/mnist.ipynb (1)

17-125: Consider adding training validation.

The training function completes successfully but doesn't validate that the model actually learned (e.g., checking loss decreased or accuracy improved). For a sanity test, consider adding basic validation.

For example, after training completes, you could add:

# At the end of train_fashion_mnist, before dist.destroy_process_group()
if dist.get_rank() == 0:
    if loss.item() > initial_loss:
        print("WARNING: Loss did not decrease during training")
    print(f"Final loss: {loss.item():.6f}")
tests/trainer/utils/utils_notebook.go (1)

49-53: Unused helper function.

CreateNotebookFromBytes is defined but not used anywhere in this PR. The test in fashion_mnist_tests.go creates the ConfigMap and Notebook separately.

Consider either:

  1. Using this helper in fashion_mnist_tests.go to reduce duplication
  2. Removing it if it's not needed yet
  3. Adding a comment explaining it's for future use
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2595866 and 796d164.

📒 Files selected for processing (6)
  • .gitignore (1 hunks)
  • tests/trainer/kubeflow_sdk_test.go (1 hunks)
  • tests/trainer/resources/mnist.ipynb (1 hunks)
  • tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
  • Test (34-45)
tests/common/notebook.go (3)
  • ContainerSize (72-72)
  • CreateNotebook (104-185)
  • NOTEBOOK_CONTAINER_NAME (37-37)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • GetPods (111-116)
tests/common/support/support.go (1)
  • TestTimeoutLong (35-35)
tests/trainer/kubeflow_sdk_test.go (2)
tests/common/test_tag.go (2)
  • Tags (32-40)
  • Sanity (48-50)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1)
  • RunFashionMnistCpuDistributedTraining (39-86)
tests/trainer/sdk_tests/fashion_mnist_tests.go (8)
tests/common/support/test.go (2)
  • T (90-102)
  • With (61-63)
tests/trainer/utils/utils_cluster_prep.go (2)
  • EnsureTrainerClusterReady (33-56)
  • EnsureNotebookRBAC (60-83)
tests/common/environment.go (2)
  • GetNotebookUserName (70-76)
  • GetNotebookUserToken (78-84)
tests/common/support/rbac.go (1)
  • CreateUserRoleBindingWithClusterRole (206-238)
tests/common/support/core.go (3)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/trainer/utils/utils_notebook.go (3)
  • BuildPapermillShellCmd (36-46)
  • WaitForNotebookPodRunning (56-64)
  • PollNotebookCompletionMarker (67-85)
tests/common/notebook.go (4)
  • CreateNotebook (104-185)
  • ContainerSizeSmall (75-75)
  • DeleteNotebook (187-190)
  • Notebooks (192-204)
tests/common/support/support.go (2)
  • TestTimeoutLong (35-35)
  • TestTimeoutDouble (36-36)
tests/trainer/utils/utils_cluster_prep.go (4)
tests/common/support/test.go (2)
  • T (90-102)
  • Test (34-45)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/core.go (2)
  • ServiceAccount (194-200)
  • ServiceAccounts (207-219)
tests/common/support/rbac.go (4)
  • CreateClusterRole (48-70)
  • CreateClusterRoleBinding (135-169)
  • CreateRole (27-46)
  • CreateRoleBinding (72-102)
🪛 GitHub Actions: Verify Generated Files and Import Organization
tests/trainer/utils/utils_notebook.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

tests/trainer/utils/utils_cluster_prep.go

[error] 1-1: openshift-goimports check failed: file is not sorted. Run 'make imports' to fix imports.

🔇 Additional comments (5)
.gitignore (1)

2-2: LGTM!

Adding .vscode/* to gitignore is a standard best practice that prevents user-specific IDE settings and workspace configurations from being committed to version control. This is especially helpful for e2e test development where contributors may use VSCode.

tests/trainer/kubeflow_sdk_test.go (1)

26-30: LGTM!

Clean test entry point that properly tags the test as Sanity and delegates to the actual test implementation. The comment on line 29 provides a clear extension point for additional tests.

tests/trainer/sdk_tests/fashion_mnist_tests.go (2)

39-86: Well-structured test implementation.

The test follows good practices:

  • Proper test namespace isolation
  • Prerequisite checks before execution
  • RBAC setup
  • Resource cleanup with defer
  • Clear error messages

The flow is logical and the use of helper utilities keeps the test readable.


33-36: Relative path verified—notebook file found at expected location.

The notebook file exists at tests/trainer/resources/mnist.ipynb, confirming the relative path "resources/" resolves correctly when tests run from the package directory.

tests/trainer/utils/utils_notebook.go (1)

36-46: LGTM!

The Papermill shell command construction is clean and handles optional extra packages appropriately. The set -e ensures failures are caught, and the marker file pattern is a good approach for async completion detection.

Comment thread tests/trainer/resources/mnist.ipynb Outdated
Comment thread tests/trainer/utils/utils_cluster_prep.go
Comment on lines +37 to +39
if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and introduces external dependencies. The test framework already provides a Kubernetes client via test.Client().

Apply this diff to use the Kubernetes API client:

-	if out, err := exec.Command("kubectl", "get", "crd", "jobsets.jobset.x-k8s.io").CombinedOutput(); err != nil {
-		t.Fatalf("JobSet CRD missing: %v\n%s", err, string(out))
-	}
+	_, err := test.Client().Dynamic().Resource(schema.GroupVersionResource{
+		Group:    "apiextensions.k8s.io",
+		Version:  "v1",
+		Resource: "customresourcedefinitions",
+	}).Get(test.Ctx(), "jobsets.jobset.x-k8s.io", metav1.GetOptions{})
+	test.Expect(err).NotTo(HaveOccurred(), "JobSet CRD missing")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tests/trainer/utils/utils_cluster_prep.go around lines 37 to 39, replace the
exec.Command("kubectl", ...) call with a Kubernetes API call using the provided
test.Client(): call
test.Client().ApiextensionsV1().CustomResourceDefinitions().Get(ctx,
"jobsets.jobset.x-k8s.io", metav1.GetOptions{}) (using a context, e.g.,
context.Background()), and if it returns an error fail the test with t.Fatalf
including the error; remove the exec.Command usage and its output handling and
ensure the file imports context and metav1 (and any apiextensions client types
if missing).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion!
Will switch to it most likely but after comments from others.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MStokluska, sounds good! Feel free to implement the change once you've gathered feedback from the other reviewers. I'm here if you need any help with the implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just temporary and will be removed once Trainer v2 is onboarded into ODH/RHOAI, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so then we can keep it as it is now, not to invest much effort into something what will be deleted soon.

Comment on lines +42 to +44
if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace kubectl shell command with Kubernetes client API and remove hardcoded namespace.

Using exec.Command("kubectl", ...) is brittle. Additionally, the namespace "opendatahub" is hardcoded, reducing reusability.

Apply this diff to use the Kubernetes API client with Eventually for polling:

-	if out, err := exec.Command("kubectl", "-n", "opendatahub", "wait", "--for=condition=available", "--timeout=180s", "deploy/kubeflow-trainer-controller-manager").CombinedOutput(); err != nil {
-		t.Fatalf("Trainer controller not available: %v\n%s", err, string(out))
-	}
+	odhNamespace := "opendatahub" // Consider making this configurable
+	test.Eventually(func(g Gomega) {
+		deploy, err := test.Client().Core().AppsV1().Deployments(odhNamespace).Get(test.Ctx(), "kubeflow-trainer-controller-manager", metav1.GetOptions{})
+		g.Expect(err).NotTo(HaveOccurred())
+		g.Expect(deploy.Status.Conditions).To(ContainElement(MatchFields(IgnoreExtras, Fields{
+			"Type":   Equal(appsv1.DeploymentAvailable),
+			"Status": Equal(corev1.ConditionTrue),
+		})))
+	}, 180*time.Second).Should(Succeed(), "Trainer controller not available")

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch all review comments from this PR to find the "above" comment being referenced
gh pr view 488 --repo opendatahub-io/distributed-workloads --json comments,reviews --jq '.reviews[].comments[] | select(.author.login == "coderabbitai" or .author.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .author.login}' | head -50

Length of output: 25


🏁 Script executed:

#!/bin/bash
# Try fetching review comments using the GitHub API directly
gh api repos/opendatahub-io/distributed-workloads/pulls/488/comments --jq '.[] | select(.user.login == "coderabbitai[bot]" or .user.login == "MStokluska") | {path: .path, line: .line, body: .body, author: .user.login}' | head -100

Length of output: 18212


@MStokluska, understood! I'll wait for your updates after you've gathered feedback from the team. Feel free to reach out if you need any assistance with the implementation.

🐰✨

Comment thread tests/trainer/utils/utils_cluster_prep.go Outdated
Comment thread tests/trainer/utils/utils_notebook.go
Comment thread tests/trainer/utils/utils_notebook.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/trainer/utils/utils_notebook.go (2)

17-32: Fix import organization per pipeline failure.

The pipeline check failed because imports are not properly sorted. Run make imports to fix the import organization.


66-85: Replace kubectl shell command with Kubernetes client API.

Using exec.Command("kubectl", ...) is brittle and requires kubectl in PATH. This should use the Kubernetes client API via the pod exec subresource as detailed in the previous review comment.

🧹 Nitpick comments (1)
tests/trainer/utils/utils_notebook.go (1)

55-64: Eliminate redundant GetPods call.

Lines 62-63 repeat the same GetPods call that was already executed in the Eventually block (lines 58-60), which is inefficient.

Apply this diff to store and reuse the pods result:

 func WaitForNotebookPodRunning(test Test, namespace string) (string, string) {
 	labelSelector := fmt.Sprintf("notebook-name=%s", common.NOTEBOOK_CONTAINER_NAME)
+	var pods []corev1.Pod
 	test.Eventually(func() []corev1.Pod {
-		return GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		pods = GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
+		return pods
 	}, TestTimeoutLong).Should(HaveLen(1), "Expected exactly one notebook pod")
 
-	pods := GetPods(test, namespace, metav1.ListOptions{LabelSelector: labelSelector, FieldSelector: "status.phase=Running"})
 	return pods[0].Name, pods[0].Spec.Containers[0].Name
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 796d164 and 4e71919.

📒 Files selected for processing (2)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/utils/utils_cluster_prep.go
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
  • Test (34-45)
tests/common/notebook.go (3)
  • ContainerSize (72-72)
  • CreateNotebook (104-185)
  • NOTEBOOK_CONTAINER_NAME (37-37)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • GetPods (111-116)
tests/common/support/support.go (1)
  • TestTimeoutLong (35-35)
🔇 Additional comments (2)
tests/trainer/utils/utils_notebook.go (2)

34-46: LGTM! Papermill command construction is sound.

The shell command properly installs dependencies, executes the notebook with papermill, writes a completion marker, and uses sleep infinity to keep the container running for inspection. The use of set -e ensures proper error propagation.


48-53: LGTM! Notebook setup orchestration is correct.

The function properly orchestrates the creation of a ConfigMap, PVC (10Gi is reasonable for test scenarios), and Notebook CR, passing through all necessary parameters.

Comment thread tests/trainer/resources/mnist.ipynb Outdated
" dataset = datasets.FashionMNIST(\n",
" \"./data\",\n",
" train=True,\n",
" download=True,\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will become an issue for running tests on disconnected clusters.
In Trainer v1 tests we uploaded dataset on AWS S3, it is downloaded from there if AWS env variables are declared - https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/kfto/resources/kfto_sdk_mnist.py#L67

Comment thread tests/trainer/resources/mnist.ipynb Outdated
"for runtime in client.list_runtimes():\n",
" print(runtime)\n",
" if runtime.name == \"universal\": # Update to actual universal image runtime once available\n",
" torch_runtime = runtime"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this approach instead of getting universal runtime by client.get_runtime("universal")?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That notebook is basically a copy from usptream test that SDK upstream relies on. I wanted to keep it as close as possible to original but I guess there's no harm in moving to get_runtime. Thanks Karel!

Comment thread tests/trainer/resources/mnist.ipynb Outdated
" \"cpu\": 2,\n",
" \"memory\": \"8Gi\",\n",
" },\n",
" packages_to_install=[\"torchvision\"],\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to install specific version, to make sure that a future upgrade doesn't break test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think it makes sense. Thanks.

}
// TODO: Extend / tweak with universal image runtime once available
for _, name := range []string{"torch-cuda-241", "torch-cuda-251", "torch-rocm-241", "torch-rocm-251"} {
test.Expect(found[name]).To(BeTrue(), fmt.Sprintf("Expected ClusterTrainingRuntime '%s' not found", name))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test verifying what TrainingRuntimes are available is already implemented in https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/trainer/custom_training_runtimes_test.go#L63
Is there any specific reason to check it here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was planning on making this test a smoke test, given that your runtime check is also smoke, I wasn't sure which one is going to go first so decided to add the additional check to capture the missing runtime error if it happens.
But since I've moved to sanity, and I'm assuming sanity will run after smoke - I guess it's fine to remove this check? WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can be removed


// EnsureNotebookRBAC sets up the Notebook ServiceAccount and RBAC so that notebooks can
// read ClusterTrainingRuntimes (cluster-scoped), and create/read TrainJobs and pod logs in the namespace.
func EnsureNotebookRBAC(t *testing.T, test Test, namespace string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather prefer passing user token into SDK client to provide needed access.
IMHO that is the approach which majority of customers would take.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks Karel

// PollNotebookCompletionMarker polls the given marker file inside the notebook pod until SUCCESS/FAILURE or timeout.
func PollNotebookCompletionMarker(test Test, namespace, podName, containerName, marker string, timeout time.Duration) error {
var finalErr error
test.Eventually(func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about whether it would be simpler to print status into Workbench log directly.

It could potentially mix with other log messages produced by Workbench, but it would avoid having to access Pod filesystem.
WDYT?


// Wait for the Notebook Pod and get pod/container names
podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have sense to add assertion checking that TrainJob is created and successfully finished?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is very important :)
Thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, as agreed, we are going to leave it out.

@MStokluska
Copy link
Copy Markdown
Contributor Author

MStokluska commented Nov 18, 2025

hey @sutaakar
I've addressed the comments and had to implement a couple of workarounds...

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/trainer/utils/utils_notebook.go (1)

34-39: Helpers are solid; consider eventually replacing kubectl logs with Kubernetes client calls.

The three helpers fit together nicely:

  • CreateNotebookFromBytes abstracts ConfigMap + Notebook + PVC creation.
  • WaitForNotebookPodRunning correctly waits for a single Running notebook pod using the shared label/constant.
  • PollNotebookLogsForStatus gives a clean SUCCESS/FAILURE signal based on log markers.

One medium-term improvement (already raised earlier) is the reliance on exec.Command("kubectl", ...) in PollNotebookLogsForStatus. It couples the tests to an external binary and kubeconfig, even though you already have a Kubernetes client in test.Client() that can stream logs directly.

When these tests move from “temporary” to more permanent, it would be worth switching to a client-go–based log reader (or pod exec) to remove the kubectl dependency and keep everything within the API client.

#!/bin/bash
# Inventory remaining direct kubectl usages to clean up once the approach is finalized.
rg -n 'exec.Command\("kubectl"' tests -C2 || true

Also applies to: 41-50, 52-71

🧹 Nitpick comments (2)
tests/trainer/utils/utils_cluster_prep.go (1)

31-38: ServiceAccount helper looks good; consider reusing shared constant for the name.

The creation logic and IsAlreadyExists handling are solid. To avoid future drift with other helpers (e.g., common.NOTEBOOK_CONTAINER_NAME used for notebook naming and labels), consider sourcing saName from that shared constant instead of duplicating the literal "jupyter-nb-kube-3aadmin".

This keeps the SA name aligned across notebook creation, deletion, and pod selection helpers.

tests/trainer/resources/mnist.ipynb (1)

47-79: Align comment with current behavior and tidy unused unpacked variable.

Two small polish items in the training cell:

  1. The comment:
# Prefer shared PVC if present; else fallback to internet download (rank 0 only)

no longer matches the logic:

use_pvc = os.path.isdir(pvc_raw) and any(os.scandir(pvc_raw))

if not use_pvc:
    raise RuntimeError("Shared PVC not mounted or empty at /mnt/shared/FashionMNIST/raw; this test requires a pre-populated RWX PVC")

Since the function now strictly requires a pre-populated PVC (and upstream cells handle S3/HTTP fallback), consider updating the comment to reflect that requirement to avoid confusion for future readers.

  1. In _read_idx_labels, num is unpacked but unused:
magic, num = struct.unpack(">II", f.read(8))

Changing this to

magic, _ = struct.unpack(">II", f.read(8))

(or similar) makes the intent explicit and keeps linters quiet.

These are non-functional cleanups but will make the notebook easier to maintain.

Also applies to: 83-97

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e71919 and 5f22561.

📒 Files selected for processing (5)
  • tests/trainer/kubeflow_sdk_test.go (1 hunks)
  • tests/trainer/resources/mnist.ipynb (1 hunks)
  • tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/kubeflow_sdk_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
tests/trainer/sdk_tests/fashion_mnist_tests.go (11)
tests/common/support/test.go (2)
  • T (90-102)
  • With (61-63)
tests/trainer/utils/utils_cluster_prep.go (1)
  • EnsureNotebookServiceAccount (31-39)
tests/common/environment.go (2)
  • GetNotebookUserName (70-76)
  • GetNotebookUserToken (78-84)
tests/common/support/rbac.go (1)
  • CreateUserRoleBindingWithClusterRole (206-238)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • StorageClassName (228-233)
tests/common/support/environment.go (5)
  • GetStorageBucketDefaultEndpoint (152-155)
  • GetStorageBucketAccessKeyId (162-165)
  • GetStorageBucketSecretKey (167-170)
  • GetStorageBucketName (172-175)
  • GetStorageBucketMnistDir (177-180)
tests/common/support/storage.go (1)
  • GetRWXStorageClass (23-36)
tests/common/support/config.go (1)
  • GetOpenShiftApiUrl (44-56)
tests/common/notebook.go (4)
  • CreateNotebook (104-185)
  • ContainerSizeSmall (75-75)
  • DeleteNotebook (187-190)
  • Notebooks (192-204)
tests/common/support/support.go (2)
  • TestTimeoutLong (35-35)
  • TestTimeoutDouble (36-36)
tests/trainer/utils/utils_notebook.go (2)
  • WaitForNotebookPodRunning (42-50)
  • PollNotebookLogsForStatus (53-72)
tests/trainer/utils/utils_notebook.go (4)
tests/common/support/test.go (1)
  • Test (34-45)
tests/common/notebook.go (3)
  • ContainerSize (72-72)
  • CreateNotebook (104-185)
  • NOTEBOOK_CONTAINER_NAME (37-37)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • GetPods (111-116)
tests/common/support/support.go (1)
  • TestTimeoutLong (35-35)
tests/trainer/utils/utils_cluster_prep.go (3)
tests/common/support/test.go (2)
  • T (90-102)
  • Test (34-45)
tests/common/support/core.go (2)
  • ServiceAccount (194-200)
  • ServiceAccounts (207-219)
tests/common/support/client.go (1)
  • Client (39-50)
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)


70-70: Avoid specifying long messages outside the exception class

(TRY003)


76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


78-78: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Do not catch blind exception: Exception

(BLE001)


207-207: Do not catch blind exception: Exception

(BLE001)


298-298: Do not catch blind exception: Exception

(BLE001)


303-305: try-except-pass detected, consider logging the exception

(S110)


303-303: Do not catch blind exception: Exception

(BLE001)


311-311: Do not catch blind exception: Exception

(BLE001)


348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


358-358: Do not catch blind exception: Exception

(BLE001)


392-392: Do not catch blind exception: Exception

(BLE001)


398-398: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


398-398: Avoid specifying long messages outside the exception class

(TRY003)

Comment thread tests/trainer/sdk_tests/fashion_mnist_tests.go Outdated
@briangallagher
Copy link
Copy Markdown
Contributor

@MStokluska Instead of copying the notebook, would it be possible to reference it, in a similar way to what we do for PR testing here?

@MStokluska
Copy link
Copy Markdown
Contributor Author

@MStokluska Instead of copying the notebook, would it be possible to reference it, in a similar way to what we do for PR testing here?

With the amount of edits required - not really. For example - the shared PV between notebook/trainer workers is needed due to how we download dataset to prep for isolated envs (and there's more subtle differences..)

Comment thread tests/trainer/resources/mnist.ipynb Outdated
" print(\"[notebook] Internet download completed successfully\")\n",
" except Exception as e:\n",
" print(f\"[notebook] Internet download failed: {e}\")\n",
" print(\"[notebook] WARNING: Dataset may be incomplete!\")\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the test to fail in this case, WDYT?
it doesn't have sense to proceed without dataset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will fix it up. Thanks Karel

Comment thread tests/trainer/resources/mnist.ipynb Outdated
" torch_runtime = client.get_runtime(\"torch-distributed\")\n",
"except Exception:\n",
" torch_runtime = next(\n",
" (r for r in client.list_runtimes() if r.name == \"torch-distributed\"),\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this?
I would expect that if client.get_runtime(\"torch-distributed\") fails then torch-distributed doesn't exist.

Copy link
Copy Markdown
Contributor

@sutaakar sutaakar Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking whether it would have sense to pass runtime as parameter/env variable for the notebook.
Will this notebook run for both CUDA and ROCm runtimes?

Edit - the notebook doesn't use GPU, so no reason to check specific runtimes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the expectation here is that the "torch distributed" will be available.
The additional check is just in case there's something wrong with the get_runtime vs list_runtimes - but now that you mentioned it, I think it might make more sense to just get_runtime and fail if not there.

Comment thread tests/trainer/utils/utils_notebook.go Outdated
func PollNotebookLogsForStatus(test Test, namespace, podName, containerName string, timeout time.Duration) error {
var finalErr error
test.Eventually(func() bool {
out, err := exec.Command("kubectl", "-n", namespace, "logs", podName, "-c", containerName, "--tail=2000").CombinedOutput()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you.

@MStokluska
Copy link
Copy Markdown
Contributor Author

@sutaakar I think all comments are addressed, let me know if you think there's anything else that I should improve. Thanks!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/trainer/sdk_tests/fashion_mnist_tests.go (1)

112-114: Refactor to idiomatic Gomega assertion.

The current pattern is non-idiomatic: checking if err != nil and then expecting it To(Succeed()) will always fail when err is non-nil.

Apply this refactor for clarity:

-	if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {
-		test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")
-	}
+	err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)
+	test.Expect(err).NotTo(HaveOccurred(), "Notebook execution reported FAILURE")
🧹 Nitpick comments (1)
tests/trainer/resources/mnist.ipynb (1)

401-421: Consider pinning the Kubeflow Trainer SDK version.

For test reproducibility, you may want to pin the kubeflow-trainer package version in the notebook environment or shell command (similar to how boto3==1.34.162 is pinned on line 90 of fashion_mnist_tests.go). This ensures future SDK changes don't unexpectedly break the test.

If testing against the latest SDK version is intentional, you can safely ignore this suggestion.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f22561 and 8ae0dee.

📒 Files selected for processing (6)
  • .gitignore (1 hunks)
  • tests/trainer/kubeflow_sdk_test.go (1 hunks)
  • tests/trainer/resources/mnist.ipynb (1 hunks)
  • tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
  • tests/trainer/utils/utils_cluster_prep.go (1 hunks)
  • tests/trainer/utils/utils_notebook.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • .gitignore
  • tests/trainer/kubeflow_sdk_test.go
  • tests/trainer/utils/utils_notebook.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/sdk_tests/fashion_mnist_tests.go (11)
tests/common/support/test.go (2)
  • T (90-102)
  • With (61-63)
tests/trainer/utils/utils_cluster_prep.go (1)
  • EnsureNotebookServiceAccount (31-39)
tests/common/environment.go (2)
  • GetNotebookUserName (70-76)
  • GetNotebookUserToken (78-84)
tests/common/support/rbac.go (1)
  • CreateUserRoleBindingWithClusterRole (206-238)
tests/common/support/core.go (4)
  • CreateConfigMap (33-54)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
  • StorageClassName (228-233)
tests/common/support/environment.go (5)
  • GetStorageBucketDefaultEndpoint (152-155)
  • GetStorageBucketAccessKeyId (162-165)
  • GetStorageBucketSecretKey (167-170)
  • GetStorageBucketName (172-175)
  • GetStorageBucketMnistDir (177-180)
tests/common/support/storage.go (1)
  • GetRWXStorageClass (23-36)
tests/common/support/config.go (1)
  • GetOpenShiftApiUrl (44-56)
tests/common/notebook.go (4)
  • CreateNotebook (104-185)
  • ContainerSizeSmall (75-75)
  • DeleteNotebook (187-190)
  • Notebooks (192-204)
tests/common/support/support.go (2)
  • TestTimeoutLong (35-35)
  • TestTimeoutDouble (36-36)
tests/trainer/utils/utils_notebook.go (2)
  • WaitForNotebookPodRunning (41-49)
  • PollNotebookLogsForStatus (52-81)
tests/trainer/utils/utils_cluster_prep.go (3)
tests/common/support/test.go (2)
  • T (90-102)
  • Test (34-45)
tests/common/support/core.go (2)
  • ServiceAccount (194-200)
  • ServiceAccounts (207-219)
tests/common/support/client.go (1)
  • Client (40-52)
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)


70-70: Avoid specifying long messages outside the exception class

(TRY003)


76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


78-78: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Do not catch blind exception: Exception

(BLE001)


207-207: Do not catch blind exception: Exception

(BLE001)


298-298: Do not catch blind exception: Exception

(BLE001)


303-305: try-except-pass detected, consider logging the exception

(S110)


303-303: Do not catch blind exception: Exception

(BLE001)


311-311: Do not catch blind exception: Exception

(BLE001)


348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


358-358: Do not catch blind exception: Exception

(BLE001)


360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


360-360: Avoid specifying long messages outside the exception class

(TRY003)


393-393: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
tests/trainer/utils/utils_cluster_prep.go (1)

29-39: LGTM! Clean helper implementation.

The function correctly ensures the ServiceAccount exists, properly handles AlreadyExists errors, and fails with a clear message on genuine creation failures. The implementation is straightforward and follows Go testing best practices.

Comment on lines +112 to +114
if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {
test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble); err != nil {
test.Expect(err).To(Succeed(), "Notebook execution reported FAILURE")
}
err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)
test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE")

Comment thread tests/trainer/kubeflow_sdk_test.go Outdated
sdktests "github.com/opendatahub-io/distributed-workloads/tests/trainer/sdk_tests"
)

func TestKubeflowSDK_Sanity(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestKubeflowSDK_Sanity(t *testing.T) {
func TestKubeflowSdkSanity(t *testing.T) {

@sutaakar
Copy link
Copy Markdown
Contributor

@MStokluska I ran the test using NOTEBOOK_IMAGE=quay.io/bgallagher/universal-image:v2 and got error:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 5
      3 from kubernetes import client as k8s_client
      4 from kubeflow.trainer import TrainerClient
----> 5 from kubeflow.common.types import KubernetesBackendConfig
      7 openshift_api_url = os.getenv("OPENSHIFT_API_URL", "")
      8 token = os.getenv("NOTEBOOK_TOKEN", "")

ModuleNotFoundError: No module named 'kubeflow.common'

NOTEBOOK_STATUS: FAILURE

@MStokluska
Copy link
Copy Markdown
Contributor Author

MStokluska commented Nov 24, 2025

@MStokluska I ran the test using NOTEBOOK_IMAGE=quay.io/bgallagher/universal-image:v2 and got error:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 5
      3 from kubernetes import client as k8s_client
      4 from kubeflow.trainer import TrainerClient
----> 5 from kubeflow.common.types import KubernetesBackendConfig
      7 openshift_api_url = os.getenv("OPENSHIFT_API_URL", "")
      8 token = os.getenv("NOTEBOOK_TOKEN", "")

ModuleNotFoundError: No module named 'kubeflow.common'

NOTEBOOK_STATUS: FAILURE

@sutaakar please run it with: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:odh-training-th03-157dbcf42c0dee17933360e343bfafe6-build-images or :latest

Brians image might be outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/trainer/resources/mnist.ipynb (3)

17-155: Distributed training function and IDX dataset reader look good; minor cleanups possible

The CPU‑only DDP setup, strict PVC requirement at /mnt/shared/FashionMNIST/raw, and IDX dataset handling all look consistent with the e2e test design and the shared PVC contract.

Two small nits:

  • In _read_idx_labels, the unpacked num header value is never used; consider renaming it to _/_num or adding a simple integrity check using it to satisfy linters.
  • The imports from urllib.parse import urlparse and import gzip, shutil in this cell are now unused and can be dropped to keep the cell minimal.

163-379: S3/HTTP dataset provisioning flow is reasonable; exception handling could be slightly tightened

The dataset prep logic—writing into FASHION_RAW_DIR on the notebook’s PVC, streaming S3/MinIO downloads with short timeouts, decompressing .gz files, and finally falling back to the Fashion‑MNIST HTTP endpoint only if needed—provides a good balance between robustness and test friendliness. The final RuntimeError("Internet download failed; aborting test") ensures the test fails rather than training without data, which aligns with earlier feedback.

Given the static analysis hints, the only tweaks you might consider (non‑blocking):

  • Several except Exception as e handlers (S3 fetch, pagination, decompression, HTTP download) are intentionally broad. For clarity and to quiet BLE001/S110, you could either narrow them to the specific client/library error types you expect or at least log type(e)/repr(e) before returning/raising, so failures are more diagnosable.
  • Where you intentionally swallow errors (e.g., failing to os.remove(dst) after successful decompression), a brief comment as you already have (“Not critical...”) is enough; no change needed there.

Overall, the behavior is fine for test code; these are just lint‑driven niceties.
Based on static analysis hints.


399-556: Kubeflow SDK integration and job lifecycle handling align with the test’s intent; consider future assertions

The SDK client setup with OPENSHIFT_API_URL/NOTEBOOK_TOKEN, runtime lookup for "torch-distributed" via client.get_runtime(...), and submission of a 2‑node CustomTrainer job with the RWX PVC mounted at /mnt/shared all look coherent with the notebook’s train_fashion_mnist() expectations.

Job lifecycle cells (wait for Running/Complete, print step summaries, stream logs, then delete_job) are appropriate for a sanity‑level e2e. If you later want stronger validation, a possible follow‑up would be to assert on step/job statuses in the notebook itself (rather than only printing) to make failures more explicit, but that’s optional given the current log‑driven Go test harness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae0dee and 098deb6.

📒 Files selected for processing (3)
  • tests/trainer/kubeflow_sdk_test.go (1 hunks)
  • tests/trainer/resources/mnist.ipynb (1 hunks)
  • tests/trainer/sdk_tests/fashion_mnist_tests.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/kubeflow_sdk_test.go
🧰 Additional context used
🪛 Ruff (0.14.5)
tests/trainer/resources/mnist.ipynb

62-62: Avoid specifying long messages outside the exception class

(TRY003)


70-70: Avoid specifying long messages outside the exception class

(TRY003)


76-76: Unpacked variable num is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


78-78: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Do not catch blind exception: Exception

(BLE001)


207-207: Do not catch blind exception: Exception

(BLE001)


298-298: Do not catch blind exception: Exception

(BLE001)


303-305: try-except-pass detected, consider logging the exception

(S110)


303-303: Do not catch blind exception: Exception

(BLE001)


311-311: Do not catch blind exception: Exception

(BLE001)


348-348: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


358-358: Do not catch blind exception: Exception

(BLE001)


360-360: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


360-360: Avoid specifying long messages outside the exception class

(TRY003)


393-393: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +33 to +115
const (
notebookName = "mnist.ipynb"
notebookPath = "resources/" + notebookName
)

// CPU Only - Distributed Training
func RunFashionMnistCpuDistributedTraining(t *testing.T) {
test := support.With(t)

// Create a new test namespace
namespace := test.NewTestNamespace()

// Ensure Notebook ServiceAccount exists (no extra RBAC)
trainerutils.EnsureNotebookServiceAccount(t, test, namespace.Name)

// RBACs setup
userName := common.GetNotebookUserName(test)
userToken := common.GetNotebookUserToken(test)
support.CreateUserRoleBindingWithClusterRole(test, userName, namespace.Name, "admin")

// Create ConfigMap with notebook
localPath := notebookPath
nb, err := os.ReadFile(localPath)
test.Expect(err).NotTo(HaveOccurred(), fmt.Sprintf("failed to read notebook: %s", localPath))
cm := support.CreateConfigMap(test, namespace.Name, map[string][]byte{notebookName: nb})

// Build command with parameters and pinned deps, and print definitive status line to logs
endpoint, endpointOK := support.GetStorageBucketDefaultEndpoint()
accessKey, _ := support.GetStorageBucketAccessKeyId()
secretKey, _ := support.GetStorageBucketSecretKey()
bucket, bucketOK := support.GetStorageBucketName()
prefix, _ := support.GetStorageBucketMnistDir()
if !endpointOK {
endpoint = ""
}
if !bucketOK {
bucket = ""
}
// Create RWX PVC for shared dataset and pass the claim name to the notebook
storageClass, err := support.GetRWXStorageClass(test)
test.Expect(err).NotTo(HaveOccurred(), "Failed to find an RWX supporting StorageClass")
rwxPvc := support.CreatePersistentVolumeClaim(
test,
namespace.Name,
"20Gi",
support.AccessModes(corev1.ReadWriteMany),
support.StorageClassName(storageClass.Name),
)

shellCmd := fmt.Sprintf(
"set -e; "+
"export OPENSHIFT_API_URL='%s'; export NOTEBOOK_TOKEN='%s'; "+
"export NOTEBOOK_NAMESPACE='%s'; "+
"export SHARED_PVC_NAME='%s'; "+
"export AWS_DEFAULT_ENDPOINT='%s'; export AWS_ACCESS_KEY_ID='%s'; "+
"export AWS_SECRET_ACCESS_KEY='%s'; export AWS_STORAGE_BUCKET='%s'; "+
"export AWS_STORAGE_BUCKET_MNIST_DIR='%s'; "+
"python -m pip install --quiet --no-cache-dir papermill boto3==1.34.162 && "+
"if python -m papermill -k python3 /opt/app-root/notebooks/%s /opt/app-root/src/out.ipynb --log-output; "+
"then echo 'NOTEBOOK_STATUS: SUCCESS'; else echo 'NOTEBOOK_STATUS: FAILURE'; fi; sleep infinity",
support.GetOpenShiftApiUrl(test), userToken, namespace.Name, rwxPvc.Name,
endpoint, accessKey, secretKey, bucket, prefix,
notebookName,
)
command := []string{"/bin/sh", "-c", shellCmd}

// Create Notebook CR using the RWX PVC
common.CreateNotebook(test, namespace, userToken, command, cm.Name, notebookName, 0, rwxPvc, common.ContainerSizeSmall)

// Cleanup
defer func() {
common.DeleteNotebook(test, namespace)
test.Eventually(common.Notebooks(test, namespace), support.TestTimeoutLong).Should(HaveLen(0))
}()

// Wait for the Notebook Pod and get pod/container names
podName, containerName := trainerutils.WaitForNotebookPodRunning(test, namespace.Name)

// Poll logs to check if the notebook execution completed successfully
err = trainerutils.PollNotebookLogsForStatus(test, namespace.Name, podName, containerName, support.TestTimeoutDouble)
test.Expect(err).ShouldNot(HaveOccurred(), "Notebook execution reported FAILURE")

}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

End‑to‑end test wiring looks good; please verify notebook path and consider a couple of small refinements

The overall flow of RunFashionMnistCpuDistributedTraining is solid: new namespace, Notebook SA and RBAC, RWX PVC for shared data, ConfigMap‑backed notebook, papermill execution with NOTEBOOK_STATUS markers, and cleanup with a final assertion that no notebooks remain.

A few details to double‑check/tidy up:

  1. Notebook path resolution

    notebookPath is defined as:

    const (
    	notebookName = "mnist.ipynb"
    	notebookPath = "resources/" + notebookName
    )

    while the notebook added in this PR lives at tests/trainer/resources/mnist.ipynb. If there isn’t also a tests/trainer/sdk_tests/resources/mnist.ipynb (or symlink) in the repo, running the tests from the sdk_tests package will fail with “failed to read notebook: resources/mnist.ipynb”. Please confirm the actual layout and either adjust notebookPath (e.g. "../resources/"+notebookName) or move/link the file so the relative path is correct.

  2. SHARED_PVC_NAME env

    You export SHARED_PVC_NAME into the notebook container, but the current mnist.ipynb doesn’t read this variable (it uses fixed mount points /opt/app-root/src and /mnt/shared). That env var is therefore redundant right now; feel free to drop it unless you plan to consume it in the notebook later.

  3. Shell command robustness

    The shellCmd uses single‑quoted export VAR='%s' substitutions. If any of the substituted values ever contains a single quote, the shell will misparse the command. It’s unlikely for most of these (endpoint, bucket, access key), but not impossible for tokens. Longer term, it might be more robust to:

    • Pass these values via the Notebook CR’s env fields instead of through the shell, or
    • Add proper shell escaping around the interpolated values.

    Not urgent, but worth considering if this test is run in more varied environments.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Searching for mnist.ipynb locations under the repo root..."
fd -a 'mnist.ipynb' .

echo
echo "Checking path resolution from tests/trainer/sdk_tests..."
cd tests/trainer/sdk_tests

if [[ -f resources/mnist.ipynb ]]; then
  echo "OK: resources/mnist.ipynb exists relative to sdk_tests (notebookPath is consistent)."
else
  echo "WARNING: resources/mnist.ipynb not found relative to sdk_tests."
  echo "RunFashionMnistCpuDistributedTraining may fail to read the notebook; consider adjusting notebookPath (e.g., \"../resources/mnist.ipynb\") or moving/linking the file."
fi

Length of output: 399


Fix notebook path resolution to unblock test execution

Verification confirms the notebook path issue is blocking. The only mnist.ipynb in the repo is at tests/trainer/resources/mnist.ipynb, but notebookPath resolves to resources/mnist.ipynb relative to the test package directory (tests/trainer/sdk_tests/). This will cause os.ReadFile(localPath) to fail.

Required fix:

  • Adjust notebookPath to "../resources/" + notebookName (line ~35), or
  • Move/symlink the notebook to tests/trainer/sdk_tests/resources/mnist.ipynb.

Minor observations (non-blocking):

  • SHARED_PVC_NAME is exported but unused by mnist.ipynb; drop if not planned for later.
  • Shell command uses single-quoted string interpolation; consider env field injection on the Notebook CR to avoid quote-escaping issues.
🤖 Prompt for AI Agents
In tests/trainer/sdk_tests/fashion_mnist_tests.go around lines 33 to 115, the
notebookPath is wrong (it points to "resources/mnist.ipynb" relative to
tests/trainer/sdk_tests/) which causes os.ReadFile to fail; update the path to
refer to the actual file location by changing notebookPath to
"../resources/"+notebookName (or alternatively move/symlink the notebook into
tests/trainer/sdk_tests/resources/), then run the test to confirm os.ReadFile
succeeds.

Copy link
Copy Markdown
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Nov 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 39f5ccd into opendatahub-io:main Nov 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants