bug: WorkController crash after first write leaves Work with incomplete status

**Background**

I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection ([kamera](https://github.com/tgoodwin/kamera)) and found some issues in Kratix.

**Observed behavior:**

When the WorkController crashes after its first write effect — the finalizer addition at [`work_controller.go:118`](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L118) — and recovers, some orderings result in the Work object never getting its status updated after recovery.

The crash interrupts the reconcile before [`updateWorkStatus()`](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L159) (line 159) and the subsequent [`Status().Update()`](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L200) (line 200) can execute.

On recovery, the re-reconcile *does* reach `updateWorkStatus()` (the finalizer already exists so line [116](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L116) is skipped, and `Scheduler.ReconcileWork()` at line [137](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L137) is idempotent). However, `updateWorkStatus()` only writes status when it finds WorkPlacements with `WriteSucceeded=False` (line [176](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L176)). `apiMeta.IsStatusConditionFalse()` returns `false` when the condition is *absent* — so WorkPlacements that haven't been reconciled yet (no `WriteSucceeded` condition at all) appear "healthy" to this check. The function falls through to `return nil` at line [204](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L204) with no status write, and the Work object remains with incomplete status conditions indefinitely.

**Expected behavior:**

The WorkController's reconcile should be fault tolerant — a recovery after crash should produce the same final state as a clean execution.

**Proposed Fix**

The issue is that [`updateWorkStatus()`](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L165-L205) only writes status conditions when failed WorkPlacements are detected (line [181](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L181)). If no WorkPlacements exist yet (crash happened before the Scheduler created them at [line 137](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L137)), or all WorkPlacements are healthy, the function returns `nil` at [line 204](https://github.com/syntasso/kratix/blob/main/internal/controller/work_controller.go#L204) with no status write at all.

The fix: `updateWorkStatus()` should set baseline status conditions (e.g., `ScheduleSucceeded=Unknown`, `Ready=Unknown`) unconditionally when no conditions exist yet on the Work object, not only in the failure path. This ensures that after any crash point, the recovery reconcile always writes a valid (if incomplete) status, and subsequent reconciles will converge to the correct final state once the Scheduler creates the WorkPlacements.

I'm happy to put up a PR for this if it would be helpful.

**Version tested:** latest `github.com/syntasso/kratix` (k8s.io/client-go v0.34.1 / Kubernetes 1.34)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: WorkController crash after first write leaves Work with incomplete status #741

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: WorkController crash after first write leaves Work with incomplete status #741

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions