docs: add AI scaffolding tier 2 and tier 4 for distributed-workloads#933
docs: add AI scaffolding tier 2 and tier 4 for distributed-workloads#933h0pers wants to merge 2 commits into
Conversation
Add pattern references, agent hooks, and design intent documentation to satisfy AI Scaffolding Tier 2 checks (2.1, 2.2, 2.3). - Add .claude/skills/add-e2e-test/SKILL.md with test-writing guide - Add .claude/settings.json with PostToolUse hook for openshift-goimports - Add ARCHITECTURE.md documenting test suites and shared support library - Slim down AGENTS.md "Writing Tests" section to reference the skill
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds a Estimated code review effort: 2 (Simple) | ~12 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/settings.json:
- Around line 5-9: Remove the auto-executing shell hook from the
`.claude/settings.json` hook entry so the Edit/Write matcher no longer runs
arbitrary commands through the command field. Update the settings around the
hooks configuration to eliminate the shell-based `command` execution, and if
formatting is still needed, make it an explicit manual step or use a non-shell
integration that cannot execute repository-controlled code on load. Use the
`.claude/settings.json` hooks configuration as the target area.
In @.claude/skills/add-e2e-test/SKILL.md:
- Around line 48-59: The tag requirement is too narrowly scoped to
tests/trainer/ even though the shared test-tag contract is repo-wide. Update the
guidance in add-e2e-test/SKILL.md to require a tag for all new e2e tests across
the repo, aligning with tests/common/test_tag.go and ARCHITECTURE.md, and keep
the instruction that the tag must be the first statement so TEST_TIER gating
happens early; reference the tag examples and the test_tag.go contract when
revising the wording.
In `@ARCHITECTURE.md`:
- Around line 7-17: The directory-tree block in ARCHITECTURE.md uses a bare
fence, which can trigger markdownlint MD040. Update the fenced block near the
tests tree to include an explicit language label such as text so the doc renders
cleanly and passes CI; use the existing tree section as the target for the fence
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3e9d477f-96d6-43e5-a5c4-1d0bb2e80650
📒 Files selected for processing (4)
.claude/settings.json.claude/skills/add-e2e-test/SKILL.mdAGENTS.mdARCHITECTURE.md
| "matcher": "Edit|Write", | ||
| "hooks": [ | ||
| { | ||
| "type": "command", | ||
| "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win
Remove the auto-executing shell hook.
This command runs shell code from an auto-loaded .claude config on every matching edit/write, which is a supply-chain execution surface (CWE-94 / CWE-78). Keep formatter runs explicit, or move this into a non-shell integration that cannot execute arbitrary commands on repository open/use.
As per path instructions, **/.claude/**: .claude/settings.json with any "command" field is a critical supply-chain issue.
Suggested fix
- {
- "matcher": "Edit|Write",
- "hooks": [
- {
- "type": "command",
- "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi"
- }
- ]
- }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "matcher": "Edit|Write", | |
| "hooks": [ | |
| { | |
| "type": "command", | |
| "command": "if [[ \"$TOOL_INPUT\" == *\".go\"* ]]; then ./bin/openshift-goimports 2>/dev/null; fi" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/settings.json around lines 5 - 9, Remove the auto-executing shell
hook from the `.claude/settings.json` hook entry so the Edit/Write matcher no
longer runs arbitrary commands through the command field. Update the settings
around the hooks configuration to eliminate the shell-based `command` execution,
and if formatting is still needed, make it an explicit manual step or use a
non-shell integration that cannot execute repository-controlled code on load.
Use the `.claude/settings.json` hooks configuration as the target area.
Source: Path instructions
| Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set: | ||
|
|
||
| | Tag | When to use | | ||
| |-----|-------------| | ||
| | `Smoke` | Minimal deployment verification | | ||
| | `Tier1`–`Tier3` | Progressively deeper coverage | | ||
| | `Gpu(accelerator)` | Requires at least one GPU node | | ||
| | `MultiGpu(accelerator, n)` | Requires n GPUs per node | | ||
| | `MultiNode(n)` | Requires n worker nodes | | ||
| | `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU | | ||
| | `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs | | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Broaden the tag rule beyond tests/trainer/.
The shared tests/common/test_tag.go contract and ARCHITECTURE.md make tag gating look repo-wide. Limiting the rule to trainer will let new kfto/odh/fms tests miss the mandatory tier gate.
Fix
-Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:
+All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Tests in `tests/trainer/` **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set: | |
| | Tag | When to use | | |
| |-----|-------------| | |
| | `Smoke` | Minimal deployment verification | | |
| | `Tier1`–`Tier3` | Progressively deeper coverage | | |
| | `Gpu(accelerator)` | Requires at least one GPU node | | |
| | `MultiGpu(accelerator, n)` | Requires n GPUs per node | | |
| | `MultiNode(n)` | Requires n worker nodes | | |
| | `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU | | |
| | `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs | | |
| All E2E tests **must** declare a tag — this is mandatory. Apply it as the first statement so tests are skipped early when `TEST_TIER` is set: | |
| | Tag | When to use | | |
| |-----|-------------| | |
| | `Smoke` | Minimal deployment verification | | |
| | `Tier1`–`Tier3` | Progressively deeper coverage | | |
| | `Gpu(accelerator)` | Requires at least one GPU node | | |
| | `MultiGpu(accelerator, n)` | Requires n GPUs per node | | |
| | `MultiNode(n)` | Requires n worker nodes | | |
| | `MultiNodeGpu(n, accelerator)` | Requires n nodes each with at least one GPU | | |
| | `MultiNodeMultiGpu(n, accelerator, gpus)` | Requires n nodes each with at least gpus GPUs | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/add-e2e-test/SKILL.md around lines 48 - 59, The tag
requirement is too narrowly scoped to tests/trainer/ even though the shared
test-tag contract is repo-wide. Update the guidance in add-e2e-test/SKILL.md to
require a tag for all new e2e tests across the repo, aligning with
tests/common/test_tag.go and ARCHITECTURE.md, and keep the instruction that the
tag must be the first statement so TEST_TIER gating happens early; reference the
tag examples and the test_tag.go contract when revising the wording.
| ``` | ||
| tests/ | ||
| ├── kfto/ KFTO v1 — PyTorchJob-based distributed training | ||
| ├── trainer/ Kubeflow Trainer v2 — TrainJob / JobSet-based training | ||
| ├── odh/ KubeRay — Ray cluster and RayJob-based training | ||
| ├── fms/ Foundation model fine-tuning (fms-hf-tuning) | ||
| │ ├── kfto/ via KFTO PyTorchJob | ||
| │ └── trainer/ via Trainer v2 TrainJob | ||
| └── common/ Shared test infrastructure | ||
| └── support/ Client abstractions, resource helpers, test lifecycle | ||
| ``` |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Label the tree fence.
Bare fences trigger markdownlint MD040, so this doc can fail CI as-is. Use an explicit language like text on the directory-tree block.
Fix
-```
+```text
tests/
├── kfto/ KFTO v1 — PyTorchJob-based distributed training
├── trainer/ Kubeflow Trainer v2 — TrainJob / JobSet-based training
├── odh/ KubeRay — Ray cluster and RayJob-based training
├── fms/ Foundation model fine-tuning (fms-hf-tuning)
│ ├── kfto/ via KFTO PyTorchJob
│ └── trainer/ via Trainer v2 TrainJob
└── common/ Shared test infrastructure
└── support/ Client abstractions, resource helpers, test lifecycle</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 7-7: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ARCHITECTURE.md` around lines 7 - 17, The directory-tree block in
ARCHITECTURE.md uses a bare fence, which can trigger markdownlint MD040. Update
the fenced block near the tests tree to include an explicit language label such
as text so the doc renders cleanly and passes CI; use the existing tree section
as the target for the fence change.
Source: Linters/SAST tools
Add two new skills to reach the 3+ threshold required by Tier 4: - add-benchmark: guide for adding MPI benchmarks (Dockerfile, ClusterTrainingRuntime, TrainJob, CI workflow, README) - update-support-lib: guide for modifying the shared test support library (getters, condition checkers, client abstraction, options) Update AGENTS.md with references to both new skills. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/add-benchmark/SKILL.md:
- Around line 23-34: The SSH setup guidance in the add-benchmark skill should
not instruct contributors to bake credentials into the image or use a fixed
/tmp/ssh location; update the Dockerfile guidance to use runtime-provided SSH
material instead. Align the MPI runtime example with the existing benchmark
pattern by referring to the mounted Secret approach used in the mpi-runtime
setup under the benchmark helpers, and keep the multi-stage Dockerfile
instructions focused on copying artifacts, setting OpenShift-friendly
permissions, and using USER 1001 in the final stage. Ensure the guidance in the
skill document clearly references the Dockerfile and runtime example sections so
contributors implement mounted or generated keys rather than embedded ones.
- Around line 71-80: The TrainJob example is incomplete because `runtimeRef` is
shown as name-only, but the actual schema requires the full structured
reference. Update the `TrainJob` example in the skill doc to include the same
`runtimeRef` fields used by `benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml`
(`apiGroup`, `kind`, and `name`), or explicitly label the snippet as pseudocode
so it is not treated as valid YAML.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9810a22c-c86f-4ec9-a993-10e0de081f91
📒 Files selected for processing (3)
.claude/skills/add-benchmark/SKILL.md.claude/skills/update-support-lib/SKILL.mdAGENTS.md
✅ Files skipped from review due to trivial changes (2)
- AGENTS.md
- .claude/skills/update-support-lib/SKILL.md
| Follow the multi-stage build pattern used in `benchmarks/osu-benchmarks/Dockerfile`: | ||
|
|
||
| 1. **Stage 1 (builder)** - compile dependencies from source (e.g., OpenMPI, benchmark binaries) | ||
| 2. **Stage 2 (runtime)** - copy built artifacts, configure SSH for MPI, set up the runtime environment | ||
|
|
||
| Key requirements: | ||
| - Base image from `quay.io/opendatahub/` or `quay.io/modh/` | ||
| - `USER 0` only during build stages; final image must use `USER 1001` | ||
| - OpenShift GID 0 pattern: `chgrp -R 0 <dir> && chmod -R g=u <dir>` | ||
| - Allow random UID: `chmod g=u /etc/passwd` | ||
| - SSH setup with keys baked into `/tmp/ssh/` (Training Operator does not auto-inject SSH keys) | ||
| - For CUDA variants, create a separate `Dockerfile.cuda` extending the base |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Don't normalize baked SSH credentials.
This tells contributors to bake SSH keys into the image and hard-codes /tmp/ssh, which is a hard-coded credential pattern (CWE-798) and conflicts with the repo's existing MPI runtime example under benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml, which mounts SSH auth at /home/mpiuser/.ssh. Replace this with a mounted Secret or runtime-generated keypair, and keep the example path aligned.
♻️ Proposed fix
- SSH setup with keys baked into `/tmp/ssh/` (Training Operator does not auto-inject SSH keys)
+ SSH setup with a mounted Secret or runtime-generated keypair; never bake private SSH keys into the image.As per path instructions, .claude/ content is a supply-chain surface.
Also applies to: 36-65
🧰 Tools
🪛 SkillSpector (2.3.7)
[error] 31: [TM2] Chaining Abuse: Tool calls are chained to bypass individual safety checks or escalate capabilities beyond what any single tool call would allow.
Remediation: Limit tool chaining depth and validate the output of each tool before passing it to the next. Require explicit user approval for multi-step chains.
(Tool Misuse (TM2))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/add-benchmark/SKILL.md around lines 23 - 34, The SSH setup
guidance in the add-benchmark skill should not instruct contributors to bake
credentials into the image or use a fixed /tmp/ssh location; update the
Dockerfile guidance to use runtime-provided SSH material instead. Align the MPI
runtime example with the existing benchmark pattern by referring to the mounted
Secret approach used in the mpi-runtime setup under the benchmark helpers, and
keep the multi-stage Dockerfile instructions focused on copying artifacts,
setting OpenShift-friendly permissions, and using USER 1001 in the final stage.
Ensure the guidance in the skill document clearly references the Dockerfile and
runtime example sections so contributors implement mounted or generated keys
rather than embedded ones.
Source: Path instructions
| ```yaml | ||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: TrainJob | ||
| metadata: | ||
| generateName: <benchmark-name>- | ||
| namespace: <namespace> | ||
| spec: | ||
| runtimeRef: | ||
| name: <runtime-name> | ||
| trainer: |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Make the runtimeRef example schema-complete.
The TrainJob example only shows runtimeRef.name, but the repo's benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml uses apiGroup and kind as well; copying this snippet as-is will fail validation (CWE-20). Expand the example or mark it as pseudocode.
♻️ Proposed fix
runtimeRef:
+ apiGroup: trainer.kubeflow.org
+ kind: ClusterTrainingRuntime
name: <runtime-name>As per the provided benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml example, runtimeRef is a structured reference, not a name-only alias.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ```yaml | |
| apiVersion: trainer.kubeflow.org/v1alpha1 | |
| kind: TrainJob | |
| metadata: | |
| generateName: <benchmark-name>- | |
| namespace: <namespace> | |
| spec: | |
| runtimeRef: | |
| name: <runtime-name> | |
| trainer: |
🧰 Tools
🪛 SkillSpector (2.3.7)
[error] 31: [TM2] Chaining Abuse: Tool calls are chained to bypass individual safety checks or escalate capabilities beyond what any single tool call would allow.
Remediation: Limit tool chaining depth and validate the output of each tool before passing it to the next. Require explicit user approval for multi-step chains.
(Tool Misuse (TM2))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/add-benchmark/SKILL.md around lines 71 - 80, The TrainJob
example is incomplete because `runtimeRef` is shown as name-only, but the actual
schema requires the full structured reference. Update the `TrainJob` example in
the skill doc to include the same `runtimeRef` fields used by
`benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml` (`apiGroup`, `kind`, and `name`),
or explicitly label the snippet as pseudocode so it is not treated as valid
YAML.
|
Thanks for the PR @h0pers. Overall it looks good, I have a few general suggestions:
|
What this PR does / why we need it:
Adds AI scaffolding infrastructure for the distributed-workloads repo, covering Tier 2 checks (2.1, 2.2, 2.3) and Tier 4 (4.1 - on-demand skills system):
.claude/skills/add-e2e-test/SKILL.md- test-writing guide extracted from AGENTS.md into a callable skill.claude/skills/add-benchmark/SKILL.md- guide for adding new benchmarks (Dockerfile, runtime, TrainJob, CI).claude/skills/update-support-lib/SKILL.md- guide for modifying the shared test support library.claude/settings.json- PostToolUse hook to auto-runopenshift-goimportsafter Go file editsARCHITECTURE.md- documents test suite structure, suite relationships, and shared support library designWhich issue(s) this PR fixes:
Fixes #942
Checklist:
settings.jsonis valid JSONSummary by CodeRabbit
Documentation
Chores