File tree Expand file tree Collapse file tree
benchmarks/kftv2-mpi-ddp-sft Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -11,8 +11,8 @@ Distributed Supervised Fine-Tuning benchmark using **PyTorch DDP** with **MPI**
1111| Dataset | openai/gsm8k (~ 7.5 K grade-school math) |
1212| Comm backend | ** MPI** (` torch.distributed.init_process_group(backend="mpi") ` ) |
1313| Gradient sync | DDP automatic allreduce via MPI |
14- | Runtime | ` mpi -cuda-openmpi -benchmark` ClusterTrainingRuntime |
15- | Image | ` quay.io/ksuta /odh-mpi-cuda:0.0.14 ` |
14+ | Runtime | ` openmpi -cuda-benchmark` ClusterTrainingRuntime |
15+ | Image | ` quay.io/opendatahub /odh-training-cuda130-torch210-py312-openmpi41:odh-stable ` |
1616
1717### MPI communication patterns exercised
1818
Original file line number Diff line number Diff line change 66apiVersion : trainer.kubeflow.org/v1alpha1
77kind : ClusterTrainingRuntime
88metadata :
9- name : mpi -cuda-openmpi -benchmark
9+ name : openmpi -cuda-benchmark
1010 labels :
1111 trainer.kubeflow.org/framework : openmpi
1212spec :
3333 template :
3434 spec :
3535 containers :
36- - image : quay.io/ksuta /odh-mpi-cuda:0.0.14
36+ - image : quay.io/opendatahub /odh-training-cuda130-torch210-py312-openmpi41:odh-stable
3737 name : node
3838 resources :
3939 limits :
5555 command :
5656 - /usr/local/bin/uid_entrypoint.sh
5757 - /usr/sbin/sshd
58- image : quay.io/ksuta /odh-mpi-cuda:0.0.14
58+ image : quay.io/opendatahub /odh-training-cuda130-torch210-py312-openmpi41:odh-stable
5959 name : node
6060 readinessProbe :
6161 initialDelaySeconds : 3
File renamed without changes.
Original file line number Diff line number Diff line change 1111 runtimeRef :
1212 apiGroup : trainer.kubeflow.org
1313 kind : ClusterTrainingRuntime
14- name : mpi -cuda-openmpi -benchmark
14+ name : openmpi -cuda-benchmark
1515 trainer :
1616 command :
1717 - /usr/local/bin/uid_entrypoint.sh
You can’t perform that action at this time.
0 commit comments