-
Notifications
You must be signed in to change notification settings - Fork 438
Description
Background
I'm developing a tool that systematically explores controller reconciliation ordering, staleness, and fault injection (kamera).
Describe the bug
I observe that when the APIBinding reconciler crashes after its first write effect, the LogicalCluster's conditions diverge during recovery.
The APIBinding reconciler's write sequence per reconcile:
apibinding_reconcile.go:310—updateLogicalCluster(): writes resource locks to LogicalClusterapibinding_reconcile.go:417-445— CRD creation insystem:bound-crdsapibinding_reconcile.go:593-595— setsInitialBindingCompleted=True,BindingUpToDate=True,Phase=Bound(in-memory)apibinding_controller.go:497—commit(): patches APIBinding status to API server
A crash after write 1 (LogicalCluster update) but before write 4 (APIBinding status commit) leaves the APIBinding in an intermediate state where InitialBindingCompleted is not set. The LogicalClusterController then writes different conditions depending on what state it observes at recovery time — the intermediate APIBinding state causes different downstream condition evaluations depending on how far other controllers have progressed before the LogicalClusterController reconciles.
The resulting LogicalCluster conditions diverge because the LogicalClusterController reads different intermediate states depending on recovery ordering.
Expected Behaviour
After a crash and recovery, the LogicalCluster should converge to the same state regardless of which controller reconciles first.
Proposed Fix
APIBinderInitializerController and DefaultAPIBindingLifecycleController both write to LogicalCluster.Status concurrently, and the KCP committer (committer.go:129) patches the entire status object. When both controllers read LogicalCluster, modify different conditions, and commit — the second commit's merge patch overwrites the first controller's condition changes because it includes the full status as that controller saw it (read-modify-write race).
I think the fix would be to use server-side apply (SSA) with a unique field manager per controller, rather than merge patch via the committer. With SSA, APIBinderInitializerController (field manager apibinder-initializer) would own WorkspaceAPIBindingsInitialized and Status.Initializers, while DefaultAPIBindingLifecycleController (field manager default-apibinding-lifecycle) would own WorkspaceAPIBindingsReconciled. Concurrent applies would not conflict because each controller only owns its specific fields.
Additional Context
The divergent LogicalCluster conditions persist as stable end states — the system converges to one of two distinct final states depending on recovery ordering, with no further reconciliation correcting the difference.
Versions
- kcp: v0.30.0 (commit
7952f476d) - Kubernetes: simulated via kamera (based on k8s.io/client-go v0.35.0 / Kubernetes 1.35)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status