Hello,
we noticed that mpirun will not run correctly when dots are used in MPIJob names.
For example
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: myjob.1
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
restartPolicy: OnFailure
replicas: 1
template:
spec:
containers:
- image: IMAGE
name: launcher
imagePullPolicy: Always
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
Worker:
replicas: 2
template:
spec:
containers:
- image: IMAGE
name: worker
imagePullPolicy: Always
will lead to this error message when mpirun is executed:
A hostfile was provided that contains multiple definitions
of the slot count for at least one node:
hostfile: hosts
node: mpi-worker
You can either list a node multiple times, once for each slot,
or you can provide a single line that contains "slot=N". Mixing
the two methods is not supported.
Please correct the hostfile and try again.
In the image, OpenMPI v4 was installed.
I assume this is caused by how openmpi interprets the hostnames which will include the dot from the MPIJob name. Also see open-mpi/ompi#4732 (comment) for a related discussion.
This can also lead to the case where the mpirun command runs successfully but only one worker is used.
Just mentioning it here as well in case someone stumbles upon this. We will probably validate the MPIJob name on creation or check different mpi settings like -mca orte_keep_fqdn_hostnames t.
Hello,
we noticed that mpirun will not run correctly when dots are used in MPIJob names.
For example
will lead to this error message when
mpirunis executed:In the image, OpenMPI v4 was installed.
I assume this is caused by how openmpi interprets the hostnames which will include the dot from the MPIJob name. Also see open-mpi/ompi#4732 (comment) for a related discussion.
This can also lead to the case where the
mpiruncommand runs successfully but only one worker is used.Just mentioning it here as well in case someone stumbles upon this. We will probably validate the MPIJob name on creation or check different mpi settings like
-mca orte_keep_fqdn_hostnames t.