Skip to content

bug: WorkController crash after first write leaves Work with incomplete status #741

@tgoodwin

Description

@tgoodwin

Background

I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera) and found some issues in Kratix.

Observed behavior:

When the WorkController crashes after its first write effect — the finalizer addition at work_controller.go:118 — and recovers, some orderings result in the Work object never getting its status updated after recovery.

The crash interrupts the reconcile before updateWorkStatus() (line 159) and the subsequent Status().Update() (line 200) can execute.

On recovery, the re-reconcile does reach updateWorkStatus() (the finalizer already exists so line 116 is skipped, and Scheduler.ReconcileWork() at line 137 is idempotent). However, updateWorkStatus() only writes status when it finds WorkPlacements with WriteSucceeded=False (line 176). apiMeta.IsStatusConditionFalse() returns false when the condition is absent — so WorkPlacements that haven't been reconciled yet (no WriteSucceeded condition at all) appear "healthy" to this check. The function falls through to return nil at line 204 with no status write, and the Work object remains with incomplete status conditions indefinitely.

Expected behavior:

The WorkController's reconcile should be fault tolerant — a recovery after crash should produce the same final state as a clean execution.

Proposed Fix

The issue is that updateWorkStatus() only writes status conditions when failed WorkPlacements are detected (line 181). If no WorkPlacements exist yet (crash happened before the Scheduler created them at line 137), or all WorkPlacements are healthy, the function returns nil at line 204 with no status write at all.

The fix: updateWorkStatus() should set baseline status conditions (e.g., ScheduleSucceeded=Unknown, Ready=Unknown) unconditionally when no conditions exist yet on the Work object, not only in the failure path. This ensures that after any crash point, the recovery reconcile always writes a valid (if incomplete) status, and subsequent reconciles will converge to the correct final state once the Scheduler creates the WorkPlacements.

I'm happy to put up a PR for this if it would be helpful.

Version tested: latest github.com/syntasso/kratix (k8s.io/client-go v0.34.1 / Kubernetes 1.34)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions