-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
When testing the super-slice feature manually:
- Create Jobset
- (Slice CR gets created)
- Transition the Slice state to Ready
- Workload gets admitted
- Transition the slice state to Error
- Workload get suspended
- (after ~1 minute) Slice CR is garbage-collected
The Workload is never unsuspended, the new Slice CR object is not created.
The reason for this is that we transition the admission check state to Rejected:
xpk/slice/internal/controller/workload_controller.go
Lines 619 to 620 in 34c7fc7
| case len(slicesByState[v1alpha1.Error]) > 0 || len(slicesByState[v1alpha1.Deformed]) > 0: | |
| ac.State = kueue.CheckStateRejected |
Which is not retried:
I think we should in this case transition to CheckStateRetry to
A) Give the slice some time to recover
B) Create a new slice if the old one does not recover after it is deleted
Metadata
Metadata
Assignees
Labels
No labels