Refactor MPI DDP SFT benchmark and update to midstream runtime image#796
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 41 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe changes update benchmark configuration files to reference a different MPI runtime and container image. The training runtime is renamed from Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml (1)
36-58:⚠️ Potential issue | 🟠 MajorMutable
:odh-stabletag undermines benchmark reproducibility and integrity (CWE-494 / CWE-1357).Both container specs pin the image by a floating tag (
odh-stable) rather than an immutable digest. Two concrete consequences:
- Supply-chain integrity: a compromised or accidentally-overwritten
odh-stabletag would silently propagate to everyTrainJobthat applies this runtime, with no way for consumers to detect drift (CWE-494: Download of Code Without Integrity Check).- Benchmark validity: performance numbers produced against
odh-stabletoday are not comparable to runs next week — the CUDA/Torch/OpenMPI stack can change under the same tag, which is exactly the kind of variance a benchmark must hold constant.Pin by digest (and optionally keep the tag as a comment for humans):
🔒 Proposed fix
- - image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable + # Tag at time of pinning: odh-stable + - image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41@sha256:<DIGEST>- image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable + # Tag at time of pinning: odh-stable + image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41@sha256:<DIGEST>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml` around lines 36 - 58, The YAML uses a mutable image tag "quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable" in both container specs (groupName/name "node" template spec and the earlier container entry); replace each image tag with the corresponding immutable digest form (quay.io/...@sha256:<digest>) so the runtime is pinned, and optionally retain the human-readable ":odh-stable" as a comment next to the image entries for clarity; update both occurrences in the container specs under the template->spec->template->spec->containers and the earlier container block so every container uses the digest-pinned image.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml`:
- Around line 36-58: The YAML uses a mutable image tag
"quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable"
in both container specs (groupName/name "node" template spec and the earlier
container entry); replace each image tag with the corresponding immutable digest
form (quay.io/...@sha256:<digest>) so the runtime is pinned, and optionally
retain the human-readable ":odh-stable" as a comment next to the image entries
for clarity; update both occurrences in the container specs under the
template->spec->template->spec->containers and the earlier container block so
every container uses the digest-pinned image.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 308152c3-c296-405b-8f6e-44ed1f9c20c4
📒 Files selected for processing (4)
benchmarks/kftv2-mpi-ddp-sft/README.mdbenchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yamlbenchmarks/kftv2-mpi-ddp-sft/train_sft_ddp.pybenchmarks/kftv2-mpi-ddp-sft/trainjob.yaml
1c5288f to
aa37de7
Compare
aa37de7 to
957268e
Compare
|
/lgtm |
957268e to
9ecc374
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kapil27, sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR involves
kftv2-mpi-ddp-sftbenchmark folder by removing the nested 1.5b/ subdirectory to simplify the project structure.openmpi-cuda-benchmarkto avoid redundant "mpi" terminology.quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stableHow Has This Been Tested?
Tested locally on a cluster.
Merge criteria:
Summary by CodeRabbit
Documentation
Chores