opendatahub-io · h0pers · Jun 25, 2026 · Jul 2, 2026 · coderabbitai · Jun 25, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -0,0 +1,15 @@
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Edit|Write",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
@@ -0,0 +1,145 @@
+# Add Benchmark
+
+Guide for adding a new benchmark to `benchmarks/` in the distributed-workloads repo.
+
+## Directory layout
+
+Each benchmark lives in its own subdirectory under `benchmarks/`:
+
+```
+benchmarks/<benchmark-name>/
+  Dockerfile              # Multi-stage build for the benchmark image
+  Dockerfile.cuda         # (optional) CUDA variant
+  mpi-runtime.yaml        # ClusterTrainingRuntime defining the MPI execution environment
+  trainjob.yaml           # TrainJob manifest to submit the benchmark
+  README.md               # Documentation (what, files, quick start, parameters, output)
+  <scripts>               # (optional) Training/benchmark scripts mounted via ConfigMap
+```
+
+See `benchmarks/osu-benchmarks/` and `benchmarks/kftv2-mpi-ddp-sft/` as reference implementations.
+
+## Dockerfile conventions
+
+Follow the multi-stage build pattern used in `benchmarks/osu-benchmarks/Dockerfile`:
+
+1. **Stage 1 (builder)** - compile dependencies from source (e.g., OpenMPI, benchmark binaries)
+2. **Stage 2 (runtime)** - copy built artifacts, configure SSH for MPI, set up the runtime environment
+
+Key requirements:
+- Base image from `quay.io/opendatahub/` or `quay.io/modh/`
+- `USER 0` only during build stages; final image must use `USER 1001`
+- OpenShift GID 0 pattern: `chgrp -R 0 <dir> && chmod -R g=u <dir>`
+- Allow random UID: `chmod g=u /etc/passwd`
+- SSH setup with keys baked into `/tmp/ssh/` (Training Operator does not auto-inject SSH keys)
+- For CUDA variants, create a separate `Dockerfile.cuda` extending the base
+
+## ClusterTrainingRuntime
+
+Define a `ClusterTrainingRuntime` resource with MPI configuration. Key fields:
+
+```yaml
+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: ClusterTrainingRuntime
+metadata:
+  name: <runtime-name>
+spec:
+  mlPolicy:
+    mpi:
+      mpiImplementation: OpenMPI
+      sshAuthMountPath: /tmp/ssh
+  template:
+    spec:
+      replicatedJobs:
+        - name: launcher
+          replicas: 1
+          template: ...
+        - name: worker
+          replicas: <N>
+          template: ...
+```
+
+- Launcher: runs the benchmark command (mpirun/mpiexec)
+- Workers: run sshd and wait for MPI connections
+- Both need the SSH setup commands in their entrypoints
+
+See `benchmarks/osu-benchmarks/mpi-runtime-cpu.yaml` for a complete example.
+
+## TrainJob
+
+Submit benchmarks using a `TrainJob` with `generateName` (not fixed `name`):
+
+```yaml
+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: TrainJob
+metadata:
+  generateName: <benchmark-name>-
+  namespace: <namespace>
+spec:
+  runtimeRef:
+    name: <runtime-name>
+  trainer:
-```yaml
-apiVersion: trainer.kubeflow.org/v1alpha1
-kind: TrainJob
-metadata:
-  generateName: <benchmark-name>-
-  namespace: <namespace>
-spec:
-  runtimeRef:
-    name: <runtime-name>
-  trainer:
-```yaml
-apiVersion: trainer.kubeflow.org/v1alpha1
-kind: TrainJob
-metadata:
-  generateName: <benchmark-name>-
-  namespace: <namespace>
-spec:
-  runtimeRef:
-    name: <runtime-name>
-  trainer:
+    numNodes: 2
+    resourcesPerNode:
+      requests:
+        nvidia.com/gpu: "2"
+    env:
+      - name: PARAM_NAME
+        value: "value"
+```
+
+Use `trainer.env` for benchmark parameters - the controller injects them into all pod containers.
+
+See `benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml` for a complete example.
+
+## Makefile targets
+
+Add build/push targets to the root `Makefile` following the existing pattern:
+
+```makefile
+BENCHMARK_VERSION ?= latest
+
+.PHONY: build-<name>-benchmark-image
+build-<name>-benchmark-image:
+	$(CONTAINER_ENGINE) build -t quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION) \
+	  -f benchmarks/<name>/Dockerfile benchmarks/<name>/
+
+.PHONY: push-<name>-benchmark-image
+push-<name>-benchmark-image:
+	$(CONTAINER_ENGINE) push quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION)
+```
+
+Registry: `quay.io/modh/distributed-workloads-benchmark`
+Tag format: `trainer-mpi-<name>-<version>`
+
+## CI workflow
+
+Create `.github/workflows/build-and-push-<name>-benchmark.yml` matching the structure in `build-and-push-osu-benchmark.yml`:
+
+- Trigger on push/PR when files under `benchmarks/<name>/` change
+- Build on all branches, push only on `main`
+- Use `docker/build-push-action` with appropriate Dockerfile path
+
+## README
+
+Every benchmark must include a `README.md` with these sections (see `benchmarks/kftv2-mpi-ddp-sft/README.md`):
+
+| Section | Content |
+|---------|---------|
+| Title + summary | One-line description of what the benchmark measures |
+| What this benchmark does | Table with algorithm, model, dataset, backend, runtime, image |
+| Files | Table mapping each file to its purpose |
+| Quick start | Numbered steps: deploy runtime, create namespace/ConfigMap, submit TrainJob, monitor |
+| Scaling | Table showing node/GPU configurations |
+| Benchmark parameters | Tables for training and infrastructure parameters with defaults and impact |
+| Expected output | Example benchmark summary output |
+| Known issues | Documented limitations and workarounds |
+| Cleanup | Commands to remove all created resources |
+
+## Checklist
+
+- [ ] Dockerfile builds successfully: `make build-<name>-benchmark-image`
+- [ ] ClusterTrainingRuntime applies: `oc apply -f benchmarks/<name>/mpi-runtime.yaml`
+- [ ] TrainJob submits and runs: `oc create -f benchmarks/<name>/trainjob.yaml`
+- [ ] README has all required sections
+- [ ] Makefile targets added for build and push
+- [ ] CI workflow triggers on path changes to `benchmarks/<name>/`
diff --git a/.claude/skills/add-e2e-test/SKILL.md b/.claude/skills/add-e2e-test/SKILL.md
@@ -0,0 +1,106 @@
+# Add E2E Test
+
+Guide for adding a new end-to-end test to the distributed-workloads repo.
+
+## Test structure
+
+```go
+func TestMyFeature(t *testing.T) {
+    Tags(t, Tier1)         // 1. tag / skip checks
+    test := With(t)        // 2. create test context
+
+    namespace := test.NewTestNamespace().Name  // 3. isolated namespace
+
+    // 4. create resources with GenerateName
+    // 5. ensure cleanup of cluster-scoped resources
+    // 6. assert with test.Eventually(...)
+}
+```
+
+## Namespace isolation
+
+Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:
+
+```go
+namespace := test.NewTestNamespace().Name
+```
+
+Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.
+
+## Resource naming
+
+All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:
+
+```go
+// Good
+ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}
+
+// Bad
+ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
+```
+
+## Cleanup
+
+Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.
+
+## Tags
+
+Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
-
-| Tag | When to use |
-|-----|-------------|
-| `Smoke` | Minimal deployment verification |
-| `Tier1`–`Tier3` | Progressively deeper coverage |
-| `Gpu(accelerator)` | Requires at least one GPU node |
-| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
-| `MultiNode(n)` | Requires n worker nodes |
-| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
-| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
-
-| Tag | When to use |
-|-----|-------------|
-| `Smoke` | Minimal deployment verification |
-| `Tier1`–`Tier3` | Progressively deeper coverage |
-| `Gpu(accelerator)` | Requires at least one GPU node |
-| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
-| `MultiNode(n)` | Requires n worker nodes |
-| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
-| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+
+| Tag | When to use |
+|-----|-------------|
+| `Smoke` | Minimal deployment verification |
+| `Tier1`–`Tier3` | Progressively deeper coverage |
+| `Gpu(accelerator)` | Requires at least one GPU node |
+| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
+| `MultiNode(n)` | Requires n worker nodes |
+| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
+| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
+## Environment variables
+
+Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.
+
+## Editing notebooks
+
+Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:
+
+```json
+// Good — array of lines, readable in raw JSON
+"source": [
+ "import os\n",
+ "print('hello')"
+]
+
+// Bad — single string, hard to read in raw JSON
+"source": "import os\nprint('hello')"
+```
+
+If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:
+
+```python
+import json
+with open(path, encoding="utf-8") as f:
+    nb = json.load(f)
+for cell in nb["cells"]:
+    if isinstance(cell["source"], str):
+        cell["source"] = cell["source"].splitlines(True)
+        # Ensure last line has no trailing newline (notebook convention)
+        if cell["source"] and cell["source"][-1].endswith("\n"):
+            cell["source"][-1] = cell["source"][-1][:-1]
+with open(path, "w", encoding="utf-8") as f:
+    json.dump(nb, f, indent=1, ensure_ascii=False)
+```
+
+## Key support library files
+
+| File | Purpose |
+|------|---------|
+| `tests/common/support/test.go` | `Test` interface — context, namespace helpers, gomega assertions |
+| `tests/common/support/client.go` | Multi-client accessor (Kubernetes, Trainer, Kubeflow, Ray, Kueue, JobSet) |
+| `tests/common/support/pytorchjob.go` | PyTorchJob getters and condition checkers |
+| `tests/common/support/trainjob.go` | TrainJob getters and condition checkers |
+| `tests/common/support/ray.go` | RayJob/RayCluster helpers |
+| `tests/common/support/kueue.go` | Kueue resource helpers (ResourceFlavor, ClusterQueue, LocalQueue) |
+| `tests/common/support/environment.go` | Environment variable getters |
+| `tests/common/test_tag.go` | Tag functions (Smoke, Tier1–3, Gpu, MultiNode, etc.) |