Skip to content

Commit 1c5288f

Browse files
committed
Updated runtime to use midstream openmpi-cuda image
1 parent 894c006 commit 1c5288f

4 files changed

Lines changed: 6 additions & 6 deletions

File tree

benchmarks/kftv2-mpi-ddp-sft-ddp/1.5b/README.md renamed to benchmarks/kftv2-mpi-ddp-sft/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ Distributed Supervised Fine-Tuning benchmark using **PyTorch DDP** with **MPI**
1111
| Dataset | openai/gsm8k (~7.5 K grade-school math) |
1212
| Comm backend | **MPI** (`torch.distributed.init_process_group(backend="mpi")`) |
1313
| Gradient sync | DDP automatic allreduce via MPI |
14-
| Runtime | `mpi-cuda-openmpi-benchmark` ClusterTrainingRuntime |
15-
| Image | `quay.io/ksuta/odh-mpi-cuda:0.0.14` |
14+
| Runtime | `openmpi-cuda-benchmark` ClusterTrainingRuntime |
15+
| Image | `quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable` |
1616

1717
### MPI communication patterns exercised
1818

benchmarks/kftv2-mpi-ddp-sft-ddp/1.5b/mpi-runtime.yaml renamed to benchmarks/kftv2-mpi-ddp-sft/mpi-runtime.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
apiVersion: trainer.kubeflow.org/v1alpha1
77
kind: ClusterTrainingRuntime
88
metadata:
9-
name: mpi-cuda-openmpi-benchmark
9+
name: openmpi-cuda-benchmark
1010
labels:
1111
trainer.kubeflow.org/framework: openmpi
1212
spec:
@@ -33,7 +33,7 @@ spec:
3333
template:
3434
spec:
3535
containers:
36-
- image: quay.io/ksuta/odh-mpi-cuda:0.0.14
36+
- image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable
3737
name: node
3838
resources:
3939
limits:
@@ -55,7 +55,7 @@ spec:
5555
command:
5656
- /usr/local/bin/uid_entrypoint.sh
5757
- /usr/sbin/sshd
58-
image: quay.io/ksuta/odh-mpi-cuda:0.0.14
58+
image: quay.io/opendatahub/odh-training-cuda130-torch210-py312-openmpi41:odh-stable
5959
name: node
6060
readinessProbe:
6161
initialDelaySeconds: 3
File renamed without changes.

benchmarks/kftv2-mpi-ddp-sft-ddp/1.5b/trainjob.yaml renamed to benchmarks/kftv2-mpi-ddp-sft/trainjob.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
runtimeRef:
1212
apiGroup: trainer.kubeflow.org
1313
kind: ClusterTrainingRuntime
14-
name: mpi-cuda-openmpi-benchmark
14+
name: openmpi-cuda-benchmark
1515
trainer:
1616
command:
1717
- /usr/local/bin/uid_entrypoint.sh

0 commit comments

Comments
 (0)