Background
I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera) and found some issues in Kratix.
Observed behavior:
When the WorkController crashes after its first write effect — the finalizer addition at work_controller.go:118 — and recovers, some orderings result in the Work object never getting its status updated after recovery.
The crash interrupts the reconcile before updateWorkStatus() (line 159) and the subsequent Status().Update() (line 200) can execute.
On recovery, the re-reconcile does reach updateWorkStatus() (the finalizer already exists so line 116 is skipped, and Scheduler.ReconcileWork() at line 137 is idempotent). However, updateWorkStatus() only writes status when it finds WorkPlacements with WriteSucceeded=False (line 176). apiMeta.IsStatusConditionFalse() returns false when the condition is absent — so WorkPlacements that haven't been reconciled yet (no WriteSucceeded condition at all) appear "healthy" to this check. The function falls through to return nil at line 204 with no status write, and the Work object remains with incomplete status conditions indefinitely.
Expected behavior:
The WorkController's reconcile should be fault tolerant — a recovery after crash should produce the same final state as a clean execution.
Proposed Fix
The issue is that updateWorkStatus() only writes status conditions when failed WorkPlacements are detected (line 181). If no WorkPlacements exist yet (crash happened before the Scheduler created them at line 137), or all WorkPlacements are healthy, the function returns nil at line 204 with no status write at all.
The fix: updateWorkStatus() should set baseline status conditions (e.g., ScheduleSucceeded=Unknown, Ready=Unknown) unconditionally when no conditions exist yet on the Work object, not only in the failure path. This ensures that after any crash point, the recovery reconcile always writes a valid (if incomplete) status, and subsequent reconciles will converge to the correct final state once the Scheduler creates the WorkPlacements.
I'm happy to put up a PR for this if it would be helpful.
Version tested: latest github.com/syntasso/kratix (k8s.io/client-go v0.34.1 / Kubernetes 1.34)
Background
I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera) and found some issues in Kratix.
Observed behavior:
When the WorkController crashes after its first write effect — the finalizer addition at
work_controller.go:118— and recovers, some orderings result in the Work object never getting its status updated after recovery.The crash interrupts the reconcile before
updateWorkStatus()(line 159) and the subsequentStatus().Update()(line 200) can execute.On recovery, the re-reconcile does reach
updateWorkStatus()(the finalizer already exists so line 116 is skipped, andScheduler.ReconcileWork()at line 137 is idempotent). However,updateWorkStatus()only writes status when it finds WorkPlacements withWriteSucceeded=False(line 176).apiMeta.IsStatusConditionFalse()returnsfalsewhen the condition is absent — so WorkPlacements that haven't been reconciled yet (noWriteSucceededcondition at all) appear "healthy" to this check. The function falls through toreturn nilat line 204 with no status write, and the Work object remains with incomplete status conditions indefinitely.Expected behavior:
The WorkController's reconcile should be fault tolerant — a recovery after crash should produce the same final state as a clean execution.
Proposed Fix
The issue is that
updateWorkStatus()only writes status conditions when failed WorkPlacements are detected (line 181). If no WorkPlacements exist yet (crash happened before the Scheduler created them at line 137), or all WorkPlacements are healthy, the function returnsnilat line 204 with no status write at all.The fix:
updateWorkStatus()should set baseline status conditions (e.g.,ScheduleSucceeded=Unknown,Ready=Unknown) unconditionally when no conditions exist yet on the Work object, not only in the failure path. This ensures that after any crash point, the recovery reconcile always writes a valid (if incomplete) status, and subsequent reconciles will converge to the correct final state once the Scheduler creates the WorkPlacements.I'm happy to put up a PR for this if it would be helpful.
Version tested: latest
github.com/syntasso/kratix(k8s.io/client-go v0.34.1 / Kubernetes 1.34)