Skip to content

Launcher PodSpec updates are not propagated to the underlying batch/v1 Job on MPIJob resume #770

@tenzen-y

Description

@tenzen-y

What happened?

When an MPIJob is created with .spec.runPolicy.suspend=true and later updated (e.g., via kubectl patch) to modify .spec.mpiReplicaSpecs["Launcher"].template fields alongside setting suspend=false, the changes are not propagated to the already-created batch/v1 Job.

The MPIJob spec correctly reflects all updates, but the launcher Job retains its original pod template — only Job.Spec.Suspend is toggled to false.

This affects all fields in the Launcher's PodTemplateSpec, including but not limited to:

  • .template.metadata.annotations
  • .template.metadata.labels
  • .template.spec.containers[*].image
  • .template.spec.containers[*].command / args
  • .template.spec.containers[*].resources
  • .template.spec.containers[*].env
  • .template.spec.volumes

What did you expect to happen?

Updates to .spec.mpiReplicaSpecs["Launcher"].template should be reflected in the owned launcher batch/v1 Job's .spec.template when the MPIJob is resumed.

How to reproduce

The following steps use annotations as a concrete example, but the same behavior applies to any Launcher template field.

Environment

  • MPI Operator: v0.7.0
  • Kubernetes: v1.35.0 (kind v0.31.0)

Steps

1. Install MPI Operator

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml
kubectl -n mpi-operator wait --for=condition=available deployment/mpi-operator --timeout=120s

2. Deploy the pi example with suspend=true

curl -sL https://raw.githubusercontent.com/kubeflow/mpi-operator/refs/heads/master/examples/v2beta1/pi/pi.yaml | yq '.spec.runPolicy.suspend = true' | kubectl apply -f -

Verify the launcher Job is created in suspended state:

$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
true

3. Confirm the Launcher template has no annotations

$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
# (empty)

4. Patch MPIJob: unsuspend + add annotation (single command)

kubectl patch mpijob pi --type=merge -p '{"spec":{"runPolicy":{"suspend":false},"mpiReplicaSpecs":{"Launcher":{"template":{"metadata":{"annotations":{"alpha":"beta"}}}}}}}'

5. Verify the MPIJob spec was updated

$ kubectl get mpijob pi -o jsonpath='{.spec.runPolicy.suspend}'
false

$ kubectl get mpijob pi -o jsonpath='{.spec.mpiReplicaSpecs.Launcher.template.metadata.annotations}'
{"alpha":"beta"}

Both fields are correctly updated on the MPIJob.

6. Check the launcher Job

$ kubectl get job pi-launcher -o jsonpath='{.spec.suspend}'
false

$ kubectl get job pi-launcher -o jsonpath='{.spec.template.metadata.annotations}'
# (empty)

The Job's suspend field was correctly set to false, but the alpha: "beta" annotation is missing from the Job's pod template. The same would occur for any other Launcher template field change.

Root cause

In pkg/controller/mpi_job_controller.go, when the launcher Job already exists, the controller only syncs the suspension state:

if launcher != nil {
    if isMPIJobSuspended(mpiJob) != isJobSuspended(launcher) {
        launcher.Spec.Suspend = ptr.To(isMPIJobSuspended(mpiJob))
        if _, err := c.kubeClient.BatchV1().Jobs(namespace).Update(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil {
            return err
        }
    }
}

There is no reconciliation of the launcher Job's pod template (.spec.template) against the desired state from mpiJob.Spec.MPIReplicaSpecs["Launcher"].Template. Changes made to the MPIJob's Launcher template after initial Job creation are silently ignored.

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions