opendatahub-io · h0pers · Jun 25, 2026 · Jul 2, 2026 · coderabbitai · Jun 25, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -0,0 +1,15 @@
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Edit|Write",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/.claude/skills/add-e2e-test/SKILL.md b/.claude/skills/add-e2e-test/SKILL.md
@@ -0,0 +1,106 @@
+# Add E2E Test
+
+Guide for adding a new end-to-end test to the distributed-workloads repo.
+
+## Test structure
+
+```go
+func TestMyFeature(t *testing.T) {
+    Tags(t, Tier1)         // 1. tag / skip checks
+    test := With(t)        // 2. create test context
+
+    namespace := test.NewTestNamespace().Name  // 3. isolated namespace
+
+    // 4. create resources with GenerateName
+    // 5. ensure cleanup of cluster-scoped resources
+    // 6. assert with test.Eventually(...)
+}
+```
+
+## Namespace isolation
+
+Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:
+
+```go
+namespace := test.NewTestNamespace().Name
+```
+
+Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.
+
+## Resource naming
+
+All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:
+
+```go
+// Good
+ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}
+
+// Bad
+ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
+```
+
+## Cleanup
+
+Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.
+
+## Tags
+
+Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
-
-| Tag | When to use |
-|-----|-------------|
-| `Smoke` | Minimal deployment verification |
-| `Tier1`–`Tier3` | Progressively deeper coverage |
-| `Gpu(accelerator)` | Requires at least one GPU node |
-| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
-| `MultiNode(n)` | Requires n worker nodes |
-| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
-| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
-
-| Tag | When to use |
-|-----|-------------|
-| `Smoke` | Minimal deployment verification |
-| `Tier1`–`Tier3` | Progressively deeper coverage |
-| `Gpu(accelerator)` | Requires at least one GPU node |
-| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
-| `MultiNode(n)` | Requires n worker nodes |
-| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
-| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+## Environment variables
+
+Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.
+
+## Editing notebooks
+
+Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:
+
+```json
+// Good — array of lines, readable in raw JSON
+"source": [
+ "import os\n",
+ "print('hello')"
+]
+
+// Bad — single string, hard to read in raw JSON
+"source": "import os\nprint('hello')"
+```
+
+If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:
+
+```python
+import json
+with open(path, encoding="utf-8") as f:
+    nb = json.load(f)
+for cell in nb["cells"]:
+    if isinstance(cell["source"], str):
+        cell["source"] = cell["source"].splitlines(True)
+        # Ensure last line has no trailing newline (notebook convention)
+        if cell["source"] and cell["source"][-1].endswith("\n"):
+            cell["source"][-1] = cell["source"][-1][:-1]
+with open(path, "w", encoding="utf-8") as f:
+    json.dump(nb, f, indent=1, ensure_ascii=False)
+```
+
+## Key support library files
+
+| File | Purpose |
+|------|---------|
+| `tests/common/support/test.go` | `Test` interface — context, namespace helpers, gomega assertions |
+| `tests/common/support/client.go` | Multi-client accessor (Kubernetes, Trainer, Kubeflow, Ray, Kueue, JobSet) |
+| `tests/common/support/pytorchjob.go` | PyTorchJob getters and condition checkers |
+| `tests/common/support/trainjob.go` | TrainJob getters and condition checkers |
+| `tests/common/support/ray.go` | RayJob/RayCluster helpers |
+| `tests/common/support/kueue.go` | Kueue resource helpers (ResourceFlavor, ClusterQueue, LocalQueue) |
+| `tests/common/support/environment.go` | Environment variable getters |
+| `tests/common/test_tag.go` | Tag functions (Smoke, Tier1–3, Gpu, MultiNode, etc.) |
diff --git a/AGENTS.md b/AGENTS.md
@@ -66,95 +66,7 @@ pre-commit run --files path/to/file.py                    # Run all hooks on a s
 
 ## Writing Tests
 
-### Namespace isolation
-
-Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:
-
-```go
-namespace := test.NewTestNamespace().Name
-```
-
-Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.
-
-### Resource naming
-
-All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:
-
-```go
-// Good
-ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}
-
-// Bad
-ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
-```
-
-### Cleanup
-
-Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.
-
-### Test structure
-
-```go
-func TestMyFeature(t *testing.T) {
-    Tags(t, Tier1)         // 1. tag / skip checks
-    test := With(t)        // 2. create test context
-
-    namespace := test.NewTestNamespace().Name  // 3. isolated namespace
-
-    // 4. create resources with GenerateName
-    // 5. ensure cleanup of cluster-scoped resources
-    // 6. assert with test.Eventually(...)
-}
-```
-
-### Editing notebooks
-
-Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:
-
-```json
-// Good — array of lines, readable in raw JSON
-"source": [
- "import os\n",
- "print('hello')"
-]
-
-// Bad — single string, hard to read in raw JSON
-"source": "import os\nprint('hello')"
-```
-
-If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:
-
-```python
-import json
-with open(path, encoding="utf-8") as f:
-    nb = json.load(f)
-for cell in nb["cells"]:
-    if isinstance(cell["source"], str):
-        cell["source"] = cell["source"].splitlines(True)
-        # Ensure last line has no trailing newline (notebook convention)
-        if cell["source"] and cell["source"][-1].endswith("\n"):
-            cell["source"][-1] = cell["source"][-1][:-1]
-with open(path, "w", encoding="utf-8") as f:
-    json.dump(nb, f, indent=1, ensure_ascii=False)
-```
-
-### Environment variables
-
-Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.
-
-### Tags
-
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
-
-| Tag | When to use |
-|-----|-------------|
-| `Smoke` | Minimal deployment verification |
-| `Tier1`–`Tier3` | Progressively deeper coverage |
-| `Gpu(accelerator)` | Requires at least one GPU node |
-| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
-| `MultiNode(n)` | Requires n worker nodes |
-| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
-| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+See [`.claude/skills/add-e2e-test/SKILL.md`](.claude/skills/add-e2e-test/SKILL.md) for the full guide on writing E2E tests (namespace isolation, resource naming, cleanup, tags, notebook editing, environment variables).
 
 ## CVE Fixes — Python dependency updates
 

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -0,0 +1,98 @@
+# Architecture
+
+E2E test suite for distributed workloads on Red Hat OpenShift AI (RHOAI), covering Kubeflow Training Operator v1 (KFTO), Kubeflow Trainer v2, and KubeRay.
+
+## Test suites
+
+```
+tests/
+├── kfto/           KFTO v1 — PyTorchJob-based distributed training
+├── trainer/        Kubeflow Trainer v2 — TrainJob / JobSet-based training
+├── odh/            KubeRay — Ray cluster and RayJob-based training
+├── fms/            Foundation model fine-tuning (fms-hf-tuning)
+│   ├── kfto/         via KFTO PyTorchJob
+│   └── trainer/      via Trainer v2 TrainJob
+└── common/         Shared test infrastructure
+    └── support/      Client abstractions, resource helpers, test lifecycle
+```
+
+### kfto — Kubeflow Training Operator v1
+
+Tests PyTorchJob-based distributed training using the legacy Kubeflow Training Operator. Covers MNIST training (single/multi-node, single/multi-GPU), LLM supervised fine-tuning (SFT), Kueue integration, SDK usage, and upgrade scenarios.
+
+### trainer — Kubeflow Trainer v2
+
+Tests TrainJob-based distributed training using the modern Kubeflow Trainer v2. Covers PyTorch DDP (Fashion MNIST, multi-node/multi-GPU), MPI jobs (OpenMPI), Kueue integration, TrainingRuntime/ClusterTrainingRuntime, Kubeflow SDK, and upgrade scenarios. This is the primary and most actively developed test suite.
+
+### odh — KubeRay
+
+Tests Ray-based distributed training via RayCluster and RayJob. Covers MNIST training, Ray Tune hyperparameter optimization, and LLM fine-tuning with DeepSpeed.
+
+### fms — Foundation model fine-tuning
+
+Tests the fms-hf-tuning container image for LLM fine-tuning (SFT, LoRA, QLoRA) through two parallel orchestration paths: KFTO PyTorchJob (`fms/kfto/`) and Trainer v2 TrainJob (`fms/trainer/`). Both paths test the same training workload with different orchestration, validating that fms-hf-tuning works correctly under each framework. Includes S3 data staging via batch jobs.
+
+## Suite relationships
+
+- **kfto** is the legacy operator; **trainer** is its modern replacement. Both test PyTorch distributed training but via different CRDs (PyTorchJob vs TrainJob).
+- **fms** tests the same fms-hf-tuning workload via both kfto and trainer, ensuring parity across orchestration frameworks.
+- **odh** covers Ray-based parallelism, complementing the PyTorch-based kfto and trainer suites.
+
+## Shared support library
+
+`tests/common/support/` provides the test infrastructure used by all suites (~40 files).
+
+### Test lifecycle
+
+- **`test.go`** — `Test` interface: wraps `*testing.T` with gomega assertions (`Eventually`, `Expect`), context management, and namespace helpers.
+- **`namespace.go`** — `NewTestNamespace()`: creates an isolated namespace per test with automatic cleanup (pod log collection, event capture, namespace deletion) via `t.Cleanup`.
+
+### Client abstraction
+
+- **`client.go`** — `Client` interface: lazy-initialized accessor for multiple Kubernetes API clients:
+  - Core Kubernetes, Dynamic, Storage
+  - Kubeflow Training Operator (`kubeflowclient`)
+  - Kubeflow Trainer v2 (`trainerclient`)
+  - KubeRay (`rayclient`)
+  - Kueue (`kueueclient`), Kueue Operator
+  - JobSet (`jobsetclient`)
+  - OpenShift Machine API, Routes, ImageStreams
+  - OLM (Operator Lifecycle Manager)
+
+### Per-API resource helpers
+
+Each distributed workload API has a dedicated helper file with getters, condition checkers, and builders:
+
+| File | API |
+|------|-----|
+| `pytorchjob.go` | PyTorchJob (Running, Succeeded, Failed, Suspended) |
+| `trainjob.go` | TrainJob (Complete, Failed, Suspended) |
+| `ray.go` | RayJob, RayCluster (status, logs) |
+| `kueue.go` | ResourceFlavor, ClusterQueue, LocalQueue, workload admission |
+| `jobset.go` | JobSet resources |
+
+### Other shared utilities
+
+| File | Purpose |
+|------|---------|
+| `environment.go` | Environment variable getters (never use `os.Getenv` directly) |
+| `core.go` | Pod, ConfigMap, Secret helpers |
+| `rbac.go` | Role / RoleBinding creation for test isolation |
+| `conditions.go` | Kubernetes condition evaluation |
+| `events.go` | Event capture for debugging |
+| `accelerator.go` | GPU node detection |
+
+### Per-suite extensions
+
+Each suite has a `support.go` that imports `tests/common/support` and adds suite-specific utilities (e.g., embedded test resource files via `//go:embed`, Prometheus queries for GPU utilization). Per-suite files extend — never wrap — the common `Test` interface.
+
+### Common utilities outside support/
+
+`tests/common/` (outside `support/`) provides cross-suite utilities:
+
+| File | Purpose |
+|------|---------|
+| `test_tag.go` | Tag functions (`Smoke`, `Tier1`–`Tier3`, `Gpu`, `MultiNode`, etc.) and `Tags()` helper for test filtering |
+| `environment.go` | Shared env var getters (test tier, notebook config, HuggingFace token) |
+| `notebook.go` | Notebook creation with GPU allocation and Kueue integration |
+| `template.go` | Go template parsing for dynamic Kubernetes manifests |