-
Notifications
You must be signed in to change notification settings - Fork 230
Description
What happened?
When an MPIJob is created with .spec.runPolicy.suspend=true and later updated (e.g., via kubectl patch) to modify .spec.mpiReplicaSpecs["Launcher"].template fields alongside setting suspend=false, the changes are not propagated to the already-created batch/v1 Job.
The MPIJob spec correctly reflects all updates, but the launcher Job retains its original pod template — only Job.Spec.Suspend is toggled to false.
This affects all fields in the Launcher's PodTemplateSpec, including but not limited to:
.template.metadata.annotations.template.metadata.labels.template.spec.containers[*].image.template.spec.containers[*].command/args.template.spec.containers[*].resources.template.spec.containers[*].env.template.spec.volumes
What did you expect to happen?
Updates to .spec.mpiReplicaSpecs["Launcher"].template should be reflected in the owned launcher batch/v1 Job's .spec.template when the MPIJob is resumed.
How to reproduce
The following steps use annotations as a concrete example, but the same behavior applies to any Launcher template field.
Environment
- MPI Operator: v0.7.0
- Kubernetes: v1.35.0 (kind v0.31.0)
Steps
1. Install MPI Operator
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml
kubectl -n mpi-operator wait --for=condition=available deployment/mpi-operator --timeout=120s2. Deploy the pi example with suspend=true
curl -sL https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/pi/pi.yaml | yq '.spec.runPolicy.suspend = true' | kubectl apply -f -Verify the launcher Job is created in suspended state:
$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
true3. Confirm the Launcher template has no annotations
$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
# (empty)4. Patch MPIJob: unsuspend + add annotation (single command)
kubectl patch mpijob pi --type=merge -p '{"spec":{"runPolicy":{"suspend":false},"mpiReplicaSpecs":{"Launcher":{"template":{"metadata":{"annotations":{"alpha":"beta"}}}}}}}'5. Verify the MPIJob spec was updated
$ kubectl get mpijob pi -o jsonpath='{.spec.runPolicy.suspend}'
false
$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
{"alpha":"beta"}Both fields are correctly updated on the MPIJob.
6. Check the launcher Job
$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
false
$ kubectl get job pi-launcher -o jsonpath='{.spec.template.metadata.annotations}'
# (empty)The Job's suspend field was correctly set to false, but the alpha: "beta" annotation is missing from the Job's pod template. The same would occur for any other Launcher template field change.
Root cause
In pkg/controller/mpi_job_controller.go, when the launcher Job already exists, the controller only syncs the suspension state:
if launcher != nil {
if isMPIJobSuspended(mpiJob) != isJobSuspended(launcher) {
launcher.Spec.Suspend = ptr.To(isMPIJobSuspended(mpiJob))
if _, err := c.kubeClient.BatchV1().Jobs(namespace).Update(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
return err
}
}
}There is no reconciliation of the launcher Job's pod template (.spec.template) against the desired state from mpiJob.Spec.MPIReplicaSpecs["Launcher"].Template. Changes made to the MPIJob's Launcher template after initial Job creation are silently ignored.
/kind bug