fix: object that creates a k8s.m ProviderConfig never becomes ready#423
fix: object that creates a k8s.m ProviderConfig never becomes ready#423erhancagirici merged 1 commit intocrossplane-contrib:mainfrom
Conversation
|
@erhancagirici @jeanduplessis this looks like a side effect of the we're also seeing this, where the Object resource never becomes ready, despite syncing the underlying managed K8s resource successfully, and keep ending up seeing diff on Observe and issuing Update() every reconcile perpetually, after upgrading from Note that the |
erhancagirici
left a comment
There was a problem hiding this comment.
@patrickleet thanks for the write-up and the PR. Looks good in general.
I wanted to add a e2e test example, but I cannot push to your fork.
Could you add the below manifest file to the PR and then add it to the UPTEST_EXAMPLE_LIST in Makefile ?
examples/namespaced/object/object-wrapped-providerconfig.yaml
apiVersion: kubernetes.m.crossplane.io/v1alpha1
kind: Object
metadata:
name: sample-wrapped-providerconfig
namespace: default
annotations:
uptest.upbound.io/timeout: "60"
spec:
forProvider:
manifest:
apiVersion: kubernetes.m.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
name: demo-provider-config
namespace: default
labels:
foo: bar
spec:
credentials:
source: Secret
secretRef:
namespace: default
name: foo-cluster-config
key: kubeconfig
providerConfigRef:
kind: ClusterProviderConfig
name: kubernetes-provider
erhancagirici
left a comment
There was a problem hiding this comment.
Thanks for the contribution! LGTM.
Since I could not hear back, and cannot add commits to your fork, I'll open a separate PR for adding the e2e test manifest.
|
Just seeing your comments! Thanks for merging! |
Description of your changes
An Object that creates a k8s.m ProviderConfig never becomes ready.
Workarounds
Don't use an Object, just create it directly in the control plane.
This is what I started off with, but the
ProviderConfigdoesn't have a Ready state. I like to gateUsagecreation by the resources it is protecting actually being ready.Even with
gotemplating.fn.crossplane.io/ready: "True"this didn't seem to satisfy the gate condition: https://github.com/hops-ops/aws-auto-eks-cluster/actions/runs/21078864135/job/60627459422Custom Readiness Signal
This is what I'm using now, and it's "working" but there are still reconciliations being triggered under the hood, though much less frequently: https://github.com/hops-ops/aws-auto-eks-cluster/actions/runs/21084074461/job/60643913257
The change
This PR fixes a regression introduced when SSA was promoted to beta (
2480bb0). At that time, provider-kubernetes began migrating legacy CSA field managers to SSA usingcsaupgrade.UpgradeManagedFieldsPatch. The migration predicate checks for a legacy manager name derived from the default REST user agent (internal/controller/fieldmanager.go), which is also used by controller-runtime Update calls for status and finalizers.As a result, the provider's own status/finalizer updates are mistaken as legacy CSA ownership and the managed fields migration is retriggered continuously. The Object never reaches Ready because the controller keeps detecting a "needed" migration on each reconcile.
This change narrows the migration predicate to ignore:
statussubresource, andmetadata.finalizers.This keeps CSA->SSA migration intact for desired-state fields (including spec-less resources like ConfigMap/Secret/RBAC with top-level
data/rules), while avoiding false positives from controller bookkeeping.Tests are added to ensure spec-less CSA ownership triggers migration and to validate that status/finalizers-only entries do not.
Effect:
ProviderConfigreachesReadywhen it’s successfully created, while still allowing CSA->SSA migration for spec-owning entries.I have:
make reviewable testto ensure this PR is ready for review.How has this code been tested
Tests Added
Unit tests in
internal/controller/{cluster,namespaced}/object/syncer_test.go:StatusOnlyUpdateIsIgnored
Verifies status subresource entries don't trigger migration
Why ignoring status subresource entries is semantically correct:
Looking at the code in syncer.go:143-163:
Status subresource semantics:
/statussubresource) - they're never part of SSA's "desired state"reconciliation
designed to migrate CSA ownership of desired state fields to SSA. Status is reported state, not applied state.
merge scenario that SSA solves for
fields entry has Subresource: "status" specifically to distinguish it.
The original bug was in the predicate, not the migration
The original
needSSAFieldManagerUpgradewas too broad - it matched the manager name without considering that:So this isn't "changing behavior" - it's fixing the predicate to correctly express the intended behavior: only migrate CSA ownership of desired state fields. Status was never supposed to be considered for SSA migration; the original code just didn't filter it out.
FinalizersOnlyUpdateIsIgnored
Verifies finalizers-only metadata entries don't trigger migration
The key question: Are finalizers part of what SSA applies?
Looking at the flow:
Finalizers are controller bookkeeping, not user desired state. The controller adds them via a separate Update
call (not Apply), so:
SpecUpdateTriggersUpgrade
Confirms spec ownership still triggers migration (regression guard)
TopLevelDataUpdateTriggersUpgrade
Confirms spec-less resources (ConfigMap/Secret) still migrate (regression guard)
NonLegacyManagerIsIgnored
Confirms only legacy CSA managers are considered
Reproduction repository:
I also created a repository to reproduce the bug, and show the patch being applied and fixing it:
full-test.sh:.mProviderConfig + Objectshttps://github.com/hops-ops/provider-kubernetes-patch-test-env
git clone https://github.com/hops-ops/provider-kubernetes-patch-test-env cd provider-kubernetes-patch-test-env PROVIDER_REPO=git@github.com:hops-ops/provider-kubernetes.git \ PROVIDER_REF=main \ ./full-test.shThis will set up the problem state, apply the patch from the PR, and then show the resource become ready.
Observed results:
AI usage
This is my first time diving into the kubernetes-provider codebase and I used AI to assist me in diagnosing the problem, so I definitely would like some eyes on this. I've been using crossplane for a few years though! :)
I asked Claude and Codex both separately to figure out the problem from a running reproduction of it, and they came up with similar answers. I made them review each other's work and they both decided that Codex's solution was better, and added some more tests to guard against regressions that Claude's work would have caused.
That said I still needed to spend a several hours digging in and making it easy to reproduce the before and after state, understanding more historical context, and asking AI to justify it's changes. This fixes the problem, but I want to make sure I'm not missing some wider contextual understanding.
Why AI thinks the change is safe
operation=Updateand the legacy manager name. Treating those entries as "needs SSA migration" is a false positive and can keep the Object reconciling indefinitely (the "never Ready" failure).metadata.finalizersorstatusmatches the intended migration behavior: migrate ownership for desired-state fields while ignoring controller-owned bookkeeping.f:specfields avoids this bug but risks skipping migration for resources whose desired fields are not underspec(e.g., ConfigMap/Secret/RBAC withdata/rulesat the top level).