Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
Comment on lines +5 to +9

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Remove the auto-executing shell hook.

This command runs shell code from an auto-loaded .claude config on every matching edit/write, which is a supply-chain execution surface (CWE-94 / CWE-78). Keep formatter runs explicit, or move this into a non-shell integration that cannot execute arbitrary commands on repository open/use.

As per path instructions, **/.claude/**: .claude/settings.json with any "command" field is a critical supply-chain issue.

Suggested fix
-      {
-        "matcher": "Edit|Write",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
-          }
-        ]
-      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/settings.json around lines 5 - 9, Remove the auto-executing shell
hook from the `.claude/settings.json` hook entry so the Edit/Write matcher no
longer runs arbitrary commands through the command field. Update the settings
around the hooks configuration to eliminate the shell-based `command` execution,
and if formatting is still needed, make it an explicit manual step or use a
non-shell integration that cannot execute repository-controlled code on load.
Use the `.claude/settings.json` hooks configuration as the target area.

Source: Path instructions

}
]
}
]
}
}
145 changes: 145 additions & 0 deletions .claude/skills/add-benchmark/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Add Benchmark

Guide for adding a new benchmark to `benchmarks/` in the distributed-workloads repo.

## Directory layout

Each benchmark lives in its own subdirectory under `benchmarks/`:

```
benchmarks/<benchmark-name>/
Dockerfile # Multi-stage build for the benchmark image
Dockerfile.cuda # (optional) CUDA variant
mpi-runtime.yaml # ClusterTrainingRuntime defining the MPI execution environment
trainjob.yaml # TrainJob manifest to submit the benchmark
README.md # Documentation (what, files, quick start, parameters, output)
<scripts> # (optional) Training/benchmark scripts mounted via ConfigMap
```

See `benchmarks/osu-benchmarks/` and `benchmarks/kftv2-mpi-ddp-sft/` as reference implementations.

## Dockerfile conventions

Follow the multi-stage build pattern used in `benchmarks/osu-benchmarks/Dockerfile`:

1. **Stage 1 (builder)** - compile dependencies from source (e.g., OpenMPI, benchmark binaries)
2. **Stage 2 (runtime)** - copy built artifacts, configure SSH for MPI, set up the runtime environment

Key requirements:
- Base image from `quay.io/opendatahub/` or `quay.io/modh/`
- `USER 0` only during build stages; final image must use `USER 1001`
- OpenShift GID 0 pattern: `chgrp -R 0 <dir> && chmod -R g=u <dir>`
- Allow random UID: `chmod g=u /etc/passwd`
- SSH setup with keys baked into `/tmp/ssh/` (Training Operator does not auto-inject SSH keys)
- For CUDA variants, create a separate `Dockerfile.cuda` extending the base
Comment on lines +23 to +34

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Don't normalize baked SSH credentials.

This tells contributors to bake SSH keys into the image and hard-codes /tmp/ssh, which is a hard-coded credential pattern (CWE-798) and conflicts with the repo's existing MPI runtime example under benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml, which mounts SSH auth at /home/mpiuser/.ssh. Replace this with a mounted Secret or runtime-generated keypair, and keep the example path aligned.

♻️ Proposed fix
- SSH setup with keys baked into `/tmp/ssh/` (Training Operator does not auto-inject SSH keys)
+ SSH setup with a mounted Secret or runtime-generated keypair; never bake private SSH keys into the image.

As per path instructions, .claude/ content is a supply-chain surface.

Also applies to: 36-65

🧰 Tools
🪛 SkillSpector (2.3.7)

[error] 31: [TM2] Chaining Abuse: Tool calls are chained to bypass individual safety checks or escalate capabilities beyond what any single tool call would allow.

Remediation: Limit tool chaining depth and validate the output of each tool before passing it to the next. Require explicit user approval for multi-step chains.

(Tool Misuse (TM2))

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/add-benchmark/SKILL.md around lines 23 - 34, The SSH setup
guidance in the add-benchmark skill should not instruct contributors to bake
credentials into the image or use a fixed /tmp/ssh location; update the
Dockerfile guidance to use runtime-provided SSH material instead. Align the MPI
runtime example with the existing benchmark pattern by referring to the mounted
Secret approach used in the mpi-runtime setup under the benchmark helpers, and
keep the multi-stage Dockerfile instructions focused on copying artifacts,
setting OpenShift-friendly permissions, and using USER 1001 in the final stage.
Ensure the guidance in the skill document clearly references the Dockerfile and
runtime example sections so contributors implement mounted or generated keys
rather than embedded ones.

Source: Path instructions


## ClusterTrainingRuntime

Define a `ClusterTrainingRuntime` resource with MPI configuration. Key fields:

```yaml
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: <runtime-name>
spec:
mlPolicy:
mpi:
mpiImplementation: OpenMPI
sshAuthMountPath: /tmp/ssh
template:
spec:
replicatedJobs:
- name: launcher
replicas: 1
template: ...
- name: worker
replicas: <N>
template: ...
```

- Launcher: runs the benchmark command (mpirun/mpiexec)
- Workers: run sshd and wait for MPI connections
- Both need the SSH setup commands in their entrypoints

See `benchmarks/osu-benchmarks/mpi-runtime-cpu.yaml` for a complete example.

## TrainJob

Submit benchmarks using a `TrainJob` with `generateName` (not fixed `name`):

```yaml
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
generateName: <benchmark-name>-
namespace: <namespace>
spec:
runtimeRef:
name: <runtime-name>
trainer:
Comment on lines +71 to +80

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Make the runtimeRef example schema-complete.

The TrainJob example only shows runtimeRef.name, but the repo's benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml uses apiGroup and kind as well; copying this snippet as-is will fail validation (CWE-20). Expand the example or mark it as pseudocode.

♻️ Proposed fix
 runtimeRef:
+  apiGroup: trainer.kubeflow.org
+  kind: ClusterTrainingRuntime
   name: <runtime-name>

As per the provided benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml example, runtimeRef is a structured reference, not a name-only alias.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```yaml
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
generateName: <benchmark-name>-
namespace: <namespace>
spec:
runtimeRef:
name: <runtime-name>
trainer:
🧰 Tools
🪛 SkillSpector (2.3.7)

[error] 31: [TM2] Chaining Abuse: Tool calls are chained to bypass individual safety checks or escalate capabilities beyond what any single tool call would allow.

Remediation: Limit tool chaining depth and validate the output of each tool before passing it to the next. Require explicit user approval for multi-step chains.

(Tool Misuse (TM2))

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/add-benchmark/SKILL.md around lines 71 - 80, The TrainJob
example is incomplete because `runtimeRef` is shown as name-only, but the actual
schema requires the full structured reference. Update the `TrainJob` example in
the skill doc to include the same `runtimeRef` fields used by
`benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml` (`apiGroup`, `kind`, and `name`),
or explicitly label the snippet as pseudocode so it is not treated as valid
YAML.

numNodes: 2
resourcesPerNode:
requests:
nvidia.com/gpu: "2"
env:
- name: PARAM_NAME
value: "value"
```

Use `trainer.env` for benchmark parameters - the controller injects them into all pod containers.

See `benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml` for a complete example.

## Makefile targets

Add build/push targets to the root `Makefile` following the existing pattern:

```makefile
BENCHMARK_VERSION ?= latest

.PHONY: build-<name>-benchmark-image
build-<name>-benchmark-image:
$(CONTAINER_ENGINE) build -t quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION) \
-f benchmarks/<name>/Dockerfile benchmarks/<name>/

.PHONY: push-<name>-benchmark-image
push-<name>-benchmark-image:
$(CONTAINER_ENGINE) push quay.io/modh/distributed-workloads-benchmark:trainer-mpi-<name>-$(BENCHMARK_VERSION)
```

Registry: `quay.io/modh/distributed-workloads-benchmark`
Tag format: `trainer-mpi-<name>-<version>`

## CI workflow

Create `.github/workflows/build-and-push-<name>-benchmark.yml` matching the structure in `build-and-push-osu-benchmark.yml`:

- Trigger on push/PR when files under `benchmarks/<name>/` change
- Build on all branches, push only on `main`
- Use `docker/build-push-action` with appropriate Dockerfile path

## README

Every benchmark must include a `README.md` with these sections (see `benchmarks/kftv2-mpi-ddp-sft/README.md`):

| Section | Content |
|---------|---------|
| Title + summary | One-line description of what the benchmark measures |
| What this benchmark does | Table with algorithm, model, dataset, backend, runtime, image |
| Files | Table mapping each file to its purpose |
| Quick start | Numbered steps: deploy runtime, create namespace/ConfigMap, submit TrainJob, monitor |
| Scaling | Table showing node/GPU configurations |
| Benchmark parameters | Tables for training and infrastructure parameters with defaults and impact |
| Expected output | Example benchmark summary output |
| Known issues | Documented limitations and workarounds |
| Cleanup | Commands to remove all created resources |

## Checklist

- [ ] Dockerfile builds successfully: `make build-<name>-benchmark-image`
- [ ] ClusterTrainingRuntime applies: `oc apply -f benchmarks/<name>/mpi-runtime.yaml`
- [ ] TrainJob submits and runs: `oc create -f benchmarks/<name>/trainjob.yaml`
- [ ] README has all required sections
- [ ] Makefile targets added for build and push
- [ ] CI workflow triggers on path changes to `benchmarks/<name>/`
106 changes: 106 additions & 0 deletions .claude/skills/add-e2e-test/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Add E2E Test

Guide for adding a new end-to-end test to the distributed-workloads repo.

## Test structure

```go
func TestMyFeature(t *testing.T) {
Tags(t, Tier1) // 1. tag / skip checks
test := With(t) // 2. create test context

namespace := test.NewTestNamespace().Name // 3. isolated namespace

// 4. create resources with GenerateName
// 5. ensure cleanup of cluster-scoped resources
// 6. assert with test.Eventually(...)
}
```

## Namespace isolation

Every test must operate in its own dedicated namespace. Use `test.NewTestNamespace()` — it creates a uniquely named namespace and registers automatic cleanup (log collection + deletion) via `t.Cleanup`:

```go
namespace := test.NewTestNamespace().Name
```

Never use a fixed namespace name unless driven by an env var for a specific scenario (e.g., pre-upgrade/post-upgrade tests). Shared namespaces cause interference between tests.

## Resource naming

All Kubernetes resources must use `GenerateName` instead of a fixed `Name` to avoid collisions:

```go
// Good
ObjectMeta: metav1.ObjectMeta{GenerateName: "test-trainjob-"}

// Bad
ObjectMeta: metav1.ObjectMeta{Name: "my-trainjob"}
```

## Cleanup

Namespace-scoped resources are deleted automatically when the test namespace is cleaned up. Cluster-scoped resources (e.g., `ClusterRole`, `ClusterRoleBinding`) are not namespace-bound and may need to be explicitly cleaned up if the helper creating them does not already register a cleanup hook via `t.T().Cleanup(...)`.

## Tags

Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:

| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1`–`Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |

Comment on lines +48 to +59

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Broaden the tag rule beyond tests/trainer/.

The shared tests/common/test_tag.go contract and ARCHITECTURE.md make tag gating look repo-wide. Limiting the rule to trainer will let new kfto/odh/fms tests miss the mandatory tier gate.

Fix
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1``Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
| Tag | When to use |
|-----|-------------|
| `Smoke` | Minimal deployment verification |
| `Tier1``Tier3` | Progressively deeper coverage |
| `Gpu(accelerator)` | Requires at least one GPU node |
| `MultiGpu(accelerator, n)` | Requires n GPUs per node |
| `MultiNode(n)` | Requires n worker nodes |
| `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU |
| `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/add-e2e-test/SKILL.md around lines 48 - 59, The tag
requirement is too narrowly scoped to tests/trainer/ even though the shared
test-tag contract is repo-wide. Update the guidance in add-e2e-test/SKILL.md to
require a tag for all new e2e tests across the repo, aligning with
tests/common/test_tag.go and ARCHITECTURE.md, and keep the instruction that the
tag must be the first statement so TEST_TIER gating happens early; reference the
tag examples and the test_tag.go contract when revising the wording.

## Environment variables

Declare env var constants and getter functions in `tests/common/support/environment.go`. Never use `os.Getenv` directly in test files — always go through a getter.

## Editing notebooks

Test notebooks (`tests/**/resources/*.ipynb`) use 1-space JSON indentation with no trailing newline. When editing notebook cells, preserve the array-of-lines source format — do not collapse source arrays into single strings:

```json
// Good — array of lines, readable in raw JSON
"source": [
"import os\n",
"print('hello')"
]

// Bad — single string, hard to read in raw JSON
"source": "import os\nprint('hello')"
```

If a tool (e.g. `NotebookEdit`) converts the edited cell's source to a single string, convert it back to array-of-lines before committing. You can use a Python script:

```python
import json
with open(path, encoding="utf-8") as f:
nb = json.load(f)
for cell in nb["cells"]:
if isinstance(cell["source"], str):
cell["source"] = cell["source"].splitlines(True)
# Ensure last line has no trailing newline (notebook convention)
if cell["source"] and cell["source"][-1].endswith("\n"):
cell["source"][-1] = cell["source"][-1][:-1]
with open(path, "w", encoding="utf-8") as f:
json.dump(nb, f, indent=1, ensure_ascii=False)
```

## Key support library files

| File | Purpose |
|------|---------|
| `tests/common/support/test.go` | `Test` interface — context, namespace helpers, gomega assertions |
| `tests/common/support/client.go` | Multi-client accessor (Kubernetes, Trainer, Kubeflow, Ray, Kueue, JobSet) |
| `tests/common/support/pytorchjob.go` | PyTorchJob getters and condition checkers |
| `tests/common/support/trainjob.go` | TrainJob getters and condition checkers |
| `tests/common/support/ray.go` | RayJob/RayCluster helpers |
| `tests/common/support/kueue.go` | Kueue resource helpers (ResourceFlavor, ClusterQueue, LocalQueue) |
| `tests/common/support/environment.go` | Environment variable getters |
| `tests/common/test_tag.go` | Tag functions (Smoke, Tier1–3, Gpu, MultiNode, etc.) |
Loading