Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
Comment on lines +5 to +9

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Remove the auto-executing shell hook.

This command runs shell code from an auto-loaded .claude config on every matching edit/write, which is a supply-chain execution surface (CWE-94 / CWE-78). Keep formatter runs explicit, or move this into a non-shell integration that cannot execute arbitrary commands on repository open/use.

As per path instructions, **/.claude/**: .claude/settings.json with any "command" field is a critical supply-chain issue.

Suggested fix
-      {
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-          }
-        ]
-      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/settings.json around lines 5 - 9, Remove the auto-executing shell
hook from the `.claude/settings.json` hook entry so the Edit/Write matcher no
longer runs arbitrary commands through the command field. Update the settings
around the hooks configuration to eliminate the shell-based `command` execution,
and if formatting is still needed, make it an explicit manual step or use a
non-shell integration that cannot execute repository-controlled code on load.
Use the `.claude/settings.json` hooks configuration as the target area.

Source: Path instructions

}
]
}
]
}
}
106 changes: 106 additions & 0 deletions .claude/skills/add-e2e-test/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Add E2E Test

Guide for adding a new end-to-end test to the distributed-workloads repo.

## Test structure

```go
func TestMyFeature(t *testing.T) {
Tags(t, Tier1) // 1. tag / skip checks
test := With(t) // 2. create test context

namespace := test.NewTestNamespace().Name // 3. isolated namespace

// 4. create resources with GenerateName
// 5. ensure cleanup of cluster-scoped resources
// 6. assert with test.Eventually(...)
}
```

## Namespace isolation

Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:

```go
namespace := test.NewTestNamespace().Name
```

Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.

## Resource naming

All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:

```go
// Good
ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}

// Bad
ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
```

## Cleanup

Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.

## Tags

Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:

| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1`–`Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |

Comment on lines +48 to +59

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Broaden the tag rule beyond tests/trainer/.

The shared tests/common/test_tag.go contract and ARCHITECTURE.md make tag gating look repo-wide. Limiting the rule to trainer will let new kfto/odh/fms tests miss the mandatory tier gate.

Fix
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1``Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1``Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/add-e2e-test/SKILL.md around lines 48 - 59, The tag
requirement is too narrowly scoped to tests/trainer/ even though the shared
test-tag contract is repo-wide. Update the guidance in add-e2e-test/SKILL.md to
require a tag for all new e2e tests across the repo, aligning with
tests/common/test_tag.go and ARCHITECTURE.md, and keep the instruction that the
tag must be the first statement so TEST_TIER gating happens early; reference the
tag examples and the test_tag.go contract when revising the wording.

## Environment variables

Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.

## Editing notebooks

Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:

```json
// Good — array of lines, readable in raw JSON
"source": [
"import os\n",
"print('hello')"
]

// Bad — single string, hard to read in raw JSON
"source": "import os\nprint('hello')"
```

If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:

```python
import json
with open(path, encoding="utf-8") as f:
nb = json.load(f)
for cell in nb["cells"]:
if isinstance(cell["source"], str):
cell["source"] = cell["source"].splitlines(True)
# Ensure last line has no trailing newline (notebook convention)
if cell["source"] and cell["source"][-1].endswith("\n"):
cell["source"][-1] = cell["source"][-1][:-1]
with open(path, "w", encoding="utf-8") as f:
json.dump(nb, f, indent=1, ensure_ascii=False)
```

## Key support library files

| File | Purpose |
|------|---------|
| `tests/common/support/test.go` | `Test` interface — context, namespace helpers, gomega assertions |
| `tests/common/support/client.go` | Multi-client accessor (Kubernetes, Trainer, Kubeflow, Ray, Kueue, JobSet) |
| `tests/common/support/pytorchjob.go` | PyTorchJob getters and condition checkers |
| `tests/common/support/trainjob.go` | TrainJob getters and condition checkers |
| `tests/common/support/ray.go` | RayJob/RayCluster helpers |
| `tests/common/support/kueue.go` | Kueue resource helpers (ResourceFlavor, ClusterQueue, LocalQueue) |
| `tests/common/support/environment.go` | Environment variable getters |
| `tests/common/test_tag.go` | Tag functions (Smoke, Tier1–3, Gpu, MultiNode, etc.) |
90 changes: 1 addition & 89 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,95 +66,7 @@ pre-commit run --files path/to/file.py # Run all hooks on a s

## Writing Tests

### Namespace isolation

Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:

```go
namespace := test.NewTestNamespace().Name
```

Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.

### Resource naming

All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:

```go
// Good
ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}

// Bad
ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
```

### Cleanup

Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.

### Test structure

```go
func TestMyFeature(t *testing.T) {
Tags(t, Tier1) // 1. tag / skip checks
test := With(t) // 2. create test context

namespace := test.NewTestNamespace().Name // 3. isolated namespace

// 4. create resources with GenerateName
// 5. ensure cleanup of cluster-scoped resources
// 6. assert with test.Eventually(...)
}
```

### Editing notebooks

Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:

```json
// Good — array of lines, readable in raw JSON
"source": [
"import os\n",
"print('hello')"
]

// Bad — single string, hard to read in raw JSON
"source": "import os\nprint('hello')"
```

If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:

```python
import json
with open(path, encoding="utf-8") as f:
nb = json.load(f)
for cell in nb["cells"]:
if isinstance(cell["source"], str):
cell["source"] = cell["source"].splitlines(True)
# Ensure last line has no trailing newline (notebook convention)
if cell["source"] and cell["source"][-1].endswith("\n"):
cell["source"][-1] = cell["source"][-1][:-1]
with open(path, "w", encoding="utf-8") as f:
json.dump(nb, f, indent=1, ensure_ascii=False)
```

### Environment variables

Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.

### Tags

Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:

| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1`–`Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
See [`.claude/skills/add-e2e-test/SKILL.md`](.claude/skills/add-e2e-test/SKILL.md) for the full guide on writing E2E tests (namespace isolation, resource naming, cleanup, tags, notebook editing, environment variables).

## CVE Fixes — Python dependency updates

Expand Down
98 changes: 98 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Architecture

E2E test suite for distributed workloads on Red Hat OpenShift AI (RHOAI), covering Kubeflow Training Operator v1 (KFTO), Kubeflow Trainer v2, and KubeRay.

## Test suites

```
tests/
├── kfto/ KFTO v1 — PyTorchJob-based distributed training
├── trainer/ Kubeflow Trainer v2 — TrainJob / JobSet-based training
├── odh/ KubeRay — Ray cluster and RayJob-based training
├── fms/ Foundation model fine-tuning (fms-hf-tuning)
│ ├── kfto/ via KFTO PyTorchJob
│ └── trainer/ via Trainer v2 TrainJob
└── common/ Shared test infrastructure
└── support/ Client abstractions, resource helpers, test lifecycle
```
Comment on lines +7 to +17

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Label the tree fence.

Bare fences trigger markdownlint MD040, so this doc can fail CI as-is. Use an explicit language like text on the directory-tree block.

Fix
-```
+```text
 tests/
 ├── kfto/           KFTO v1 — PyTorchJob-based distributed training
 ├── trainer/        Kubeflow Trainer v2 — TrainJob / JobSet-based training
 ├── odh/            KubeRay — Ray cluster and RayJob-based training
 ├── fms/            Foundation model fine-tuning (fms-hf-tuning)
 │   ├── kfto/         via KFTO PyTorchJob
 │   └── trainer/      via Trainer v2 TrainJob
 └── common/         Shared test infrastructure
     └── support/      Client abstractions, resource helpers, test lifecycle
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 7 - 17, The directory-tree block in
ARCHITECTURE.md uses a bare fence, which can trigger markdownlint MD040. Update
the fenced block near the tests tree to include an explicit language label such
as text so the doc renders cleanly and passes CI; use the existing tree section
as the target for the fence change.

Source: Linters/SAST tools


### kfto — Kubeflow Training Operator v1

Tests PyTorchJob-based distributed training using the legacy Kubeflow Training Operator. Covers MNIST training (single/multi-node, single/multi-GPU), LLM supervised fine-tuning (SFT), Kueue integration, SDK usage, and upgrade scenarios.

### trainer — Kubeflow Trainer v2

Tests TrainJob-based distributed training using the modern Kubeflow Trainer v2. Covers PyTorch DDP (Fashion MNIST, multi-node/multi-GPU), MPI jobs (OpenMPI), Kueue integration, TrainingRuntime/ClusterTrainingRuntime, Kubeflow SDK, and upgrade scenarios. This is the primary and most actively developed test suite.

### odh — KubeRay

Tests Ray-based distributed training via RayCluster and RayJob. Covers MNIST training, Ray Tune hyperparameter optimization, and LLM fine-tuning with DeepSpeed.

### fms — Foundation model fine-tuning

Tests the fms-hf-tuning container image for LLM fine-tuning (SFT, LoRA, QLoRA) through two parallel orchestration paths: KFTO PyTorchJob (`fms/kfto/`) and Trainer v2 TrainJob (`fms/trainer/`). Both paths test the same training workload with different orchestration, validating that fms-hf-tuning works correctly under each framework. Includes S3 data staging via batch jobs.

## Suite relationships

- **kfto** is the legacy operator; **trainer** is its modern replacement. Both test PyTorch distributed training but via different CRDs (PyTorchJob vs TrainJob).
- **fms** tests the same fms-hf-tuning workload via both kfto and trainer, ensuring parity across orchestration frameworks.
- **odh** covers Ray-based parallelism, complementing the PyTorch-based kfto and trainer suites.

## Shared support library

`tests/common/support/` provides the test infrastructure used by all suites (~40 files).

### Test lifecycle

- **`test.go`** — `Test` interface: wraps `*testing.T` with gomega assertions (`Eventually`, `Expect`), context management, and namespace helpers.
- **`namespace.go`** — `NewTestNamespace()`: creates an isolated namespace per test with automatic cleanup (pod log collection, event capture, namespace deletion) via `t.Cleanup`.

### Client abstraction

- **`client.go`** — `Client` interface: lazy-initialized accessor for multiple Kubernetes API clients:
- Core Kubernetes, Dynamic, Storage
- Kubeflow Training Operator (`kubeflowclient`)
- Kubeflow Trainer v2 (`trainerclient`)
- KubeRay (`rayclient`)
- Kueue (`kueueclient`), Kueue Operator
- JobSet (`jobsetclient`)
- OpenShift Machine API, Routes, ImageStreams
- OLM (Operator Lifecycle Manager)

### Per-API resource helpers

Each distributed workload API has a dedicated helper file with getters, condition checkers, and builders:

| File | API |
|------|-----|
| `pytorchjob.go` | PyTorchJob (Running, Succeeded, Failed, Suspended) |
| `trainjob.go` | TrainJob (Complete, Failed, Suspended) |
| `ray.go` | RayJob, RayCluster (status, logs) |
| `kueue.go` | ResourceFlavor, ClusterQueue, LocalQueue, workload admission |
| `jobset.go` | JobSet resources |

### Other shared utilities

| File | Purpose |
|------|---------|
| `environment.go` | Environment variable getters (never use `os.Getenv` directly) |
| `core.go` | Pod, ConfigMap, Secret helpers |
| `rbac.go` | Role / RoleBinding creation for test isolation |
| `conditions.go` | Kubernetes condition evaluation |
| `events.go` | Event capture for debugging |
| `accelerator.go` | GPU node detection |

### Per-suite extensions

Each suite has a `support.go` that imports `tests/common/support` and adds suite-specific utilities (e.g., embedded test resource files via `//go:embed`, Prometheus queries for GPU utilization). Per-suite files extend — never wrap — the common `Test` interface.

### Common utilities outside support/

`tests/common/` (outside `support/`) provides cross-suite utilities:

| File | Purpose |
|------|---------|
| `test_tag.go` | Tag functions (`Smoke`, `Tier1`–`Tier3`, `Gpu`, `MultiNode`, etc.) and `Tags()` helper for test filtering |
| `environment.go` | Shared env var getters (test tier, notebook config, HuggingFace token) |
| `notebook.go` | Notebook creation with GPU allocation and Kueue integration |
| `template.go` | Go template parsing for dynamic Kubernetes manifests |