fix: respect cleanPodPolicy when job exceeds backoffLimit by AviadHayumi · Pull Request #3420 · kubeflow/trainer

AviadHayumi · 2026-04-12T13:39:29Z

What is the problem?

When a training job exceeds its backoffLimit, all pods are deleted regardless of cleanPodPolicy: None. This is caused by a condition-setting ordering bug in ReconcileJobs() where DeletePodsAndServices() is called before the JobFailed condition is set on jobStatus.

DeletePodsAndServices guards pod deletion with IsFinished(jobStatus), which checks for JobFailed/JobSucceeded conditions — not CompletionTime. Since the condition is set after the delete call in the jobExceedsLimit block, IsFinished() returns false and the cleanPodPolicy: None guard is bypassed.

Affects all V1 job types using ReconcileJobs with backoffLimit: PyTorchJob, TFJob, XGBoostJob, PaddleJob, MPIJob.

Ref: #3419

Changes

Moved UpdateJobConditions(JobFailed) and Recorder.Event before DeletePodsAndServices() in the jobExceedsLimit block (pkg/controller.v1/common/job.go lines 216-247), so that IsFinished() returns true at the time of cleanup and cleanPodPolicy is correctly respected.

Added a test case to TestDeletePodsAndServices documenting the bug scenario (unfinished job + cleanPodPolicy: None).

Before (buggy ordering)

if jobExceedsLimit {
    jobStatus.CompletionTime = &now
    jc.DeletePodsAndServices(...)  // IsFinished() == false, cleanPodPolicy ignored
    // ...
    commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, ...)  // too late
}

After (fixed ordering)

if jobExceedsLimit {
    jobStatus.CompletionTime = &now
    commonutil.UpdateJobConditions(&jobStatus, apiv1.JobFailed, ...)  // set first
    jc.DeletePodsAndServices(...)  // IsFinished() == true, cleanPodPolicy respected
}

Testing

All unit tests pass (go test ./pkg/controller.v1/common/)
Verified on a live cluster (v1.8.1 based image): deployed the fix and created a PyTorchJob with cleanPodPolicy: None and backoffLimit: 1 — pods are now correctly preserved after the job fails due to backoff limit. Before the fix, all pods were unconditionally deleted.

/kind bug

andreyvelich

Thanks for the fix, that looks good!
/assign @astefanutti @tenzen-y @kubeflow/wg-training-leads
/lgtm

astefanutti

@AviadHayumi thanks, could you please sign the commit?

Move UpdateJobConditions(JobFailed) before DeletePodsAndServices in the jobExceedsLimit block so that IsFinished() returns true and the cleanPodPolicy: None guard is not bypassed. Previously, DeletePodsAndServices was called before the JobFailed condition was set, causing all pods to be unconditionally deleted regardless of cleanPodPolicy when a job exceeded its backoffLimit. Tested on live cluster: pods now preserved after backoffLimit failure with cleanPodPolicy: None. Fixes: kubeflow#3419 Signed-off-by: aviadh <aviad.hayumi@gmail.com>

google-oss-prow · 2026-04-15T13:07:44Z

New changes are detected. LGTM label has been removed.

google-oss-prow · 2026-04-15T13:07:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AviadHayumi · 2026-04-15T13:09:04Z

@AviadHayumi thanks, could you please sign the commit?

Done, signed

andreyvelich · 2026-04-15T15:05:25Z

/ok-to-test

AviadHayumi · 2026-04-16T08:29:44Z

@astefanutti

The failing integration tests are pre-existing - the last merged PR (#3221) has the exact same failures. These are JAXJob e2e timeouts on older K8s versions (v1.28-v1.30).

google-oss-prow Bot added the kind/bug label Apr 12, 2026

google-oss-prow Bot requested review from jinchihe and kuizhiqing April 12, 2026 13:39

google-oss-prow Bot added the size/S label Apr 12, 2026

AviadHayumi force-pushed the fix/cleanpod-backoff-limit-r19 branch from a26ff21 to 42fe179 Compare April 12, 2026 13:44

google-oss-prow Bot added size/L and removed size/S labels Apr 12, 2026

AviadHayumi mentioned this pull request Apr 13, 2026

fix: respect cleanPodPolicy when job exceeds backoffLimit run-ai/training-operator#9

Open

2 tasks

AviadHayumi force-pushed the fix/cleanpod-backoff-limit-r19 branch from 42fe179 to d3c2643 Compare April 13, 2026 08:29

andreyvelich reviewed Apr 14, 2026

View reviewed changes

google-oss-prow Bot assigned andreyvelich Apr 14, 2026

google-oss-prow Bot added the lgtm label Apr 14, 2026

astefanutti reviewed Apr 15, 2026

View reviewed changes

AviadHayumi force-pushed the fix/cleanpod-backoff-limit-r19 branch from d3c2643 to 354799f Compare April 15, 2026 13:07

google-oss-prow Bot removed the lgtm label Apr 15, 2026

AviadHayumi requested review from andreyvelich and astefanutti April 15, 2026 13:09

google-oss-prow Bot added the ok-to-test label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: respect cleanPodPolicy when job exceeds backoffLimit#3420

fix: respect cleanPodPolicy when job exceeds backoffLimit#3420
AviadHayumi wants to merge 1 commit intokubeflow:release-1.9from
AviadHayumi:fix/cleanpod-backoff-limit-r19

AviadHayumi commented Apr 12, 2026

Uh oh!

andreyvelich left a comment

Uh oh!

astefanutti left a comment

Uh oh!

google-oss-prow Bot commented Apr 15, 2026

Uh oh!

google-oss-prow Bot commented Apr 15, 2026

Uh oh!

AviadHayumi commented Apr 15, 2026

Uh oh!

andreyvelich commented Apr 15, 2026

Uh oh!

AviadHayumi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AviadHayumi commented Apr 12, 2026

What is the problem?

Changes

Before (buggy ordering)

After (fixed ordering)

Testing

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow Bot commented Apr 15, 2026

Uh oh!

google-oss-prow Bot commented Apr 15, 2026

Uh oh!

AviadHayumi commented Apr 15, 2026

Uh oh!

andreyvelich commented Apr 15, 2026

Uh oh!

AviadHayumi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants