Skip to content

fix: respect cleanPodPolicy when job exceeds backoffLimit#3420

Open
AviadHayumi wants to merge 1 commit intokubeflow:release-1.9from
AviadHayumi:fix/cleanpod-backoff-limit-r19
Open

fix: respect cleanPodPolicy when job exceeds backoffLimit#3420
AviadHayumi wants to merge 1 commit intokubeflow:release-1.9from
AviadHayumi:fix/cleanpod-backoff-limit-r19

Conversation

@AviadHayumi
Copy link
Copy Markdown

What is the problem?

When a training job exceeds its backoffLimit, all pods are deleted regardless of cleanPodPolicy: None. This is caused by a condition-setting ordering bug in ReconcileJobs() where DeletePodsAndServices() is called before the JobFailed condition is set on jobStatus.

DeletePodsAndServices guards pod deletion with IsFinished(jobStatus), which checks for JobFailed/JobSucceeded conditions — not CompletionTime. Since the condition is set after the delete call in the jobExceedsLimit block, IsFinished() returns false and the cleanPodPolicy: None guard is bypassed.

Affects all V1 job types using ReconcileJobs with backoffLimit: PyTorchJob, TFJob, XGBoostJob, PaddleJob, MPIJob.

Ref: #3419

Changes

Moved UpdateJobConditions(JobFailed) and Recorder.Event before DeletePodsAndServices() in the jobExceedsLimit block (pkg/controller.v1/common/job.go lines 216-247), so that IsFinished() returns true at the time of cleanup and cleanPodPolicy is correctly respected.

Added a test case to TestDeletePodsAndServices documenting the bug scenario (unfinished job + cleanPodPolicy: None).

Before (buggy ordering)

if jobExceedsLimit {
    jobStatus.CompletionTime = &now
    jc.DeletePodsAndServices(...)  // IsFinished() == false, cleanPodPolicy ignored
    // ...
    commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, ...)  // too late
}

After (fixed ordering)

if jobExceedsLimit {
    jobStatus.CompletionTime = &now
    commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, ...)  // set first
    jc.DeletePodsAndServices(...)  // IsFinished() == true, cleanPodPolicy respected
}

Testing

  • All unit tests pass (go test ./pkg/controller.v1/common/)
  • Verified on a live cluster (v1.8.1 based image): deployed the fix and created a PyTorchJob with cleanPodPolicy: None and backoffLimit: 1 — pods are now correctly preserved after the job fails due to backoff limit. Before the fix, all pods were unconditionally deleted.

/kind bug

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, that looks good!
/assign @astefanutti @tenzen-y @kubeflow/wg-training-leads
/lgtm

Copy link
Copy Markdown
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AviadHayumi thanks, could you please sign the commit?

Move UpdateJobConditions(JobFailed) before DeletePodsAndServices in
the jobExceedsLimit block so that IsFinished() returns true and the
cleanPodPolicy: None guard is not bypassed.

Previously, DeletePodsAndServices was called before the JobFailed
condition was set, causing all pods to be unconditionally deleted
regardless of cleanPodPolicy when a job exceeded its backoffLimit.

Tested on live cluster: pods now preserved after backoffLimit failure
with cleanPodPolicy: None.

Fixes: kubeflow#3419
Signed-off-by: aviadh <aviad.hayumi@gmail.com>
@AviadHayumi AviadHayumi force-pushed the fix/cleanpod-backoff-limit-r19 branch from d3c2643 to 354799f Compare April 15, 2026 13:07
@google-oss-prow
Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

@google-oss-prow google-oss-prow Bot removed the lgtm label Apr 15, 2026
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AviadHayumi
Copy link
Copy Markdown
Author

@AviadHayumi thanks, could you please sign the commit?

Done, signed

@andreyvelich
Copy link
Copy Markdown
Member

/ok-to-test

@AviadHayumi
Copy link
Copy Markdown
Author

@astefanutti

The failing integration tests are pre-existing - the last merged PR (#3221) has the exact same failures. These are JAXJob e2e timeouts on older K8s versions (v1.28-v1.30).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants