Fix Running condition being re-emitted when pod and job informers are out-of-sync by GonzaloSaez · Pull Request #787 · kubeflow/mpi-operator

GonzaloSaez · 2026-03-15T19:55:30Z

When the pod and job informers are out-of-sync, it's possible for the launcher job to be finished but the pod be running. In that case, the MPIJob may be considered as completed (when using runLauncherAsWorker and workers having finished). In this scenario, the running condition may be re-emitted with a last transition time after the MPIJob was deemed completed. This results in other controllers watching MPIJob to not be able to evaluate the start and end times using the last transition time of the running condition.

To fix this, we can avoid re-emitting the Running condition. Moreover, we can also ensure that the Running condition is always emitted and that the last transition time is <= the completion time.

Another solution to this would be to re-queue if we see the job and pod informers are out-of-sync but I'm not sure if the latter would be harder to implement.

… out-of-sync Signed-off-by: Gonzalo Saez <11050889+GonzaloSaez@users.noreply.github.com>

google-oss-prow · 2026-03-15T19:55:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rongou for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

GonzaloSaez · 2026-03-16T08:49:02Z

cc: @tenzen-y

tenzen-y · 2026-04-14T14:25:55Z

Sorry for the delayed checking. I found this notification right now...
Let me check this in this week.

tenzen-y

Thank you for fixing that!
I left some comments.

tenzen-y · 2026-04-20T08:21:49Z

 	if isMPIJobSuspended(mpiJob) {
 		msg := fmt.Sprintf("MPIJob %s/%s is suspended.", mpiJob.Namespace, mpiJob.Name)
 		updateMPIJobConditions(mpiJob, kubeflow.JobRunning, corev1.ConditionFalse, mpiJobSuspendedReason, msg)
+	} else if isFinished(mpiJob.Status) {


Suggested change

} else if isFinished(mpiJob.Status) {

} else if isFinished(mpiJob.Status) && getCondition(mpiJob.Status, kubeflow.JobRunning) == nil{

It seems that we can simplify the if-else structure by doing this.

tenzen-y · 2026-04-20T08:22:25Z

+		if getCondition(mpiJob.Status, kubeflow.JobRunning) == nil {
+			msg := fmt.Sprintf("MPIJob %s/%s is finished but Running condition was never set.", mpiJob.Namespace, mpiJob.Name)
+			cond := kubeflow.JobCondition{
+				Type:    kubeflow.JobRunning,
+				Status:  corev1.ConditionFalse,
+				Reason:  mpiJobRunningReason,
+				Message: msg,
+			}
+			if mpiJob.Status.CompletionTime != nil {
+				cond.LastTransitionTime = *mpiJob.Status.CompletionTime
+				cond.LastUpdateTime = *mpiJob.Status.CompletionTime
+			} else {
+				now := metav1.Now()
+				cond.LastTransitionTime = now
+				cond.LastUpdateTime = now
+			}
+			mpiJob.Status.Conditions = append(mpiJob.Status.Conditions, cond)
+		}


Suggested change

if getCondition(mpiJob.Status, kubeflow.JobRunning) == nil {

msg := fmt.Sprintf("MPIJob %s/%s is finished but Running condition was never set.", mpiJob.Namespace, mpiJob.Name)

cond := kubeflow.JobCondition{

Type: kubeflow.JobRunning,

Status: corev1.ConditionFalse,

Reason: mpiJobRunningReason,

Message: msg,

}

if mpiJob.Status.CompletionTime != nil {

cond.LastTransitionTime = *mpiJob.Status.CompletionTime

cond.LastUpdateTime = *mpiJob.Status.CompletionTime

} else {

now := metav1.Now()

cond.LastTransitionTime = now

cond.LastUpdateTime = now

}

mpiJob.Status.Conditions = append(mpiJob.Status.Conditions, cond)

}

msg := fmt.Sprintf("MPIJob %s/%s is finished but Running condition was never set.", mpiJob.Namespace, mpiJob.Name)

cond := kubeflow.JobCondition{

Type: kubeflow.JobRunning,

Status: corev1.ConditionFalse,

Reason: mpiJobRunningReason,

Message: msg,

}

updateTime := ptr.Deref(mpiJob.Status.CompletionTime, c.clock.Now())

cond.LastTransitionTime := updateTime

cond.LastUpdateTime := updateTime

mpiJob.Status.Conditions = append(mpiJob.Status.Conditions, cond)

Unnest the condition

Use the clock

Simplify pointer deref operation

tenzen-y · 2026-04-20T08:28:24Z

+	startTime := metav1.Now()
+	completionTime := metav1.Now()


Could you use clocktesting?

tenzen-y · 2026-04-20T08:43:08Z

+// TestLauncherSucceededWithRunningPod tests that when a launcher Job has succeeded but its pod is still observed as Running due to
+// informer lag). Workers have been cleaned up. The Running condition is set to False rather than being re-emitted as True alongside
+// Succeeded.
+func TestLauncherSucceededWithRunningPod(t *testing.T) {


Can you explicitly check if the Running LastConditions are later or equal to completionTime?

Because testing libraries ignore such time comparison: https://github.com/GonzaloSaez/mpi-operator/blob/8c953844bc5a5deb7ba98d5e1e85a324b6aa2645/pkg/controller/mpi_job_controller_test.go#L350

Fix Running condition being re-emitted when pod and job informers are…

8c95384

… out-of-sync Signed-off-by: Gonzalo Saez <11050889+GonzaloSaez@users.noreply.github.com>

google-oss-prow Bot requested review from carmark and gaocegege March 15, 2026 19:55

google-oss-prow Bot added the size/M label Mar 15, 2026

tenzen-y reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Running condition being re-emitted when pod and job informers are out-of-sync#787

Fix Running condition being re-emitted when pod and job informers are out-of-sync#787
GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
GonzaloSaez:gonzalo/fix_running-condition-on-completion

GonzaloSaez commented Mar 15, 2026

Uh oh!

google-oss-prow Bot commented Mar 15, 2026

Uh oh!

GonzaloSaez commented Mar 16, 2026

Uh oh!

tenzen-y commented Apr 14, 2026

Uh oh!

tenzen-y left a comment

Uh oh!

tenzen-y Apr 20, 2026

Uh oh!

tenzen-y Apr 20, 2026

Uh oh!

tenzen-y Apr 20, 2026

Uh oh!

tenzen-y Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	} else if isFinished(mpiJob.Status) {
	} else if isFinished(mpiJob.Status) && getCondition(mpiJob.Status, kubeflow.JobRunning) == nil{

Conversation

GonzaloSaez commented Mar 15, 2026

Uh oh!

google-oss-prow Bot commented Mar 15, 2026

Uh oh!

GonzaloSaez commented Mar 16, 2026

Uh oh!

tenzen-y commented Apr 14, 2026

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

tenzen-y Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tenzen-y Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tenzen-y Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

tenzen-y Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants