fix: respect cleanPodPolicy when job exceeds backoffLimit#3420
fix: respect cleanPodPolicy when job exceeds backoffLimit#3420AviadHayumi wants to merge 1 commit intokubeflow:release-1.9from
Conversation
a26ff21 to
42fe179
Compare
42fe179 to
d3c2643
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for the fix, that looks good!
/assign @astefanutti @tenzen-y @kubeflow/wg-training-leads
/lgtm
astefanutti
left a comment
There was a problem hiding this comment.
@AviadHayumi thanks, could you please sign the commit?
Move UpdateJobConditions(JobFailed) before DeletePodsAndServices in the jobExceedsLimit block so that IsFinished() returns true and the cleanPodPolicy: None guard is not bypassed. Previously, DeletePodsAndServices was called before the JobFailed condition was set, causing all pods to be unconditionally deleted regardless of cleanPodPolicy when a job exceeded its backoffLimit. Tested on live cluster: pods now preserved after backoffLimit failure with cleanPodPolicy: None. Fixes: kubeflow#3419 Signed-off-by: aviadh <aviad.hayumi@gmail.com>
d3c2643 to
354799f
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Done, signed |
|
/ok-to-test |
|
The failing integration tests are pre-existing - the last merged PR (#3221) has the exact same failures. These are JAXJob e2e timeouts on older K8s versions (v1.28-v1.30). |
What is the problem?
When a training job exceeds its
backoffLimit, all pods are deleted regardless ofcleanPodPolicy: None. This is caused by a condition-setting ordering bug inReconcileJobs()whereDeletePodsAndServices()is called before theJobFailedcondition is set onjobStatus.DeletePodsAndServicesguards pod deletion withIsFinished(jobStatus), which checks forJobFailed/JobSucceededconditions — notCompletionTime. Since the condition is set after the delete call in thejobExceedsLimitblock,IsFinished()returns false and thecleanPodPolicy: Noneguard is bypassed.Affects all V1 job types using
ReconcileJobswithbackoffLimit: PyTorchJob, TFJob, XGBoostJob, PaddleJob, MPIJob.Ref: #3419
Changes
Moved
UpdateJobConditions(JobFailed)andRecorder.EventbeforeDeletePodsAndServices()in thejobExceedsLimitblock (pkg/controller.v1/common/job.golines 216-247), so thatIsFinished()returns true at the time of cleanup andcleanPodPolicyis correctly respected.Added a test case to
TestDeletePodsAndServicesdocumenting the bug scenario (unfinished job +cleanPodPolicy: None).Before (buggy ordering)
After (fixed ordering)
Testing
go test ./pkg/controller.v1/common/)cleanPodPolicy: NoneandbackoffLimit: 1— pods are now correctly preserved after the job fails due to backoff limit. Before the fix, all pods were unconditionally deleted./kind bug