Skip to content

fix: respect cleanPodPolicy when job exceeds backoffLimit#9

Open
AviadHayumi wants to merge 1 commit intorelease-1.9from
fix/cleanpod-backoff-limit-runai
Open

fix: respect cleanPodPolicy when job exceeds backoffLimit#9
AviadHayumi wants to merge 1 commit intorelease-1.9from
fix/cleanpod-backoff-limit-runai

Conversation

@AviadHayumi
Copy link
Copy Markdown

Summary

  • When a job exceeds its backoffLimit, pods are deleted regardless of cleanPodPolicy: None
  • Root cause: DeletePodsAndServices() is called before JobFailed condition is set, so the cleanPodPolicy: None guard is bypassed
  • Fix: move UpdateJobConditions(JobFailed) before DeletePodsAndServices()
  • Added unit tests covering the fix

Upstream issue: kubeflow#3419
Upstream PR: kubeflow#3420

Test plan

  • Unit tests pass (go test ./pkg/controller.v1/common/)
  • Verified on live cluster: PyTorchJob with cleanPodPolicy: None + backoffLimit: 1 — pods preserved after failure

Move UpdateJobConditions(JobFailed) before DeletePodsAndServices in
the jobExceedsLimit block so that IsFinished() returns true and the
cleanPodPolicy: None guard is not bypassed.

Fixes: kubeflow#3419
Signed-off-by: Aviad Hayumi <aviad.hayumi@run.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant