Skip to content

[slice] Reschedule after Slice Failed #685

@pajakd

Description

@pajakd

When testing the super-slice feature manually:

  1. Create Jobset
  2. (Slice CR gets created)
  3. Transition the Slice state to Ready
  4. Workload gets admitted
  5. Transition the slice state to Error
  6. Workload get suspended
  7. (after ~1 minute) Slice CR is garbage-collected

The Workload is never unsuspended, the new Slice CR object is not created.

The reason for this is that we transition the admission check state to Rejected:

case len(slicesByState[v1alpha1.Error]) > 0 || len(slicesByState[v1alpha1.Deformed]) > 0:
ac.State = kueue.CheckStateRejected

Which is not retried:

https://github.com/kubernetes-sigs/kueue/blob/6a1f89a58334b282f0c820b889d4137a4bdd6249/apis/kueue/v1beta1/admissioncheck_types.go#L32-L35

I think we should in this case transition to CheckStateRetry to
A) Give the slice some time to recover
B) Create a new slice if the old one does not recover after it is deleted

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions