Skip to content

fix: object that creates a k8s.m ProviderConfig never becomes ready#423

Merged
erhancagirici merged 1 commit intocrossplane-contrib:mainfrom
hops-ops:main
Mar 17, 2026
Merged

fix: object that creates a k8s.m ProviderConfig never becomes ready#423
erhancagirici merged 1 commit intocrossplane-contrib:mainfrom
hops-ops:main

Conversation

@patrickleet
Copy link
Copy Markdown
Contributor

Description of your changes

An Object that creates a k8s.m ProviderConfig never becomes ready.

apiVersion: kubernetes.m.crossplane.io/v1alpha1
kind: Object
metadata:
  name: {{ $am.cluster.name }}-k8s-provider-config
  labels: {{ $am.labels | toJson }}
  annotations:
    {{ setResourceNameAnnotation "k8s-provider-config" }}
spec:
  managementPolicies: {{ $am.managementPolicies | toJson }}
  forProvider:
    manifest:
      apiVersion: kubernetes.m.crossplane.io/v1alpha1
      kind: ProviderConfig
      metadata:
        name: {{ $am.cluster.name }}
        namespace: {{ $am.namespace }}
      spec:
        credentials:
          secretRef:
            key: kubeconfig
            name: {{ $am.cluster.name }}
            namespace: {{ $am.namespace }}
          source: Secret
  providerConfigRef:
    name: {{ $am.kubernetesProviderConfig.name }}
    kind: {{ $am.kubernetesProviderConfig.kind }}

Workarounds

Don't use an Object, just create it directly in the control plane.

This is what I started off with, but the ProviderConfig doesn't have a Ready state. I like to gate Usage creation by the resources it is protecting actually being ready.

Even with gotemplating.fn.crossplane.io/ready: "True" this didn't seem to satisfy the gate condition: https://github.com/hops-ops/aws-auto-eks-cluster/actions/runs/21078864135/job/60627459422

Custom Readiness Signal

spec.
  readiness:             
    policy: DeriveFromCelQuery         
    celQuery: "has(object.metadata.uid)"  

This is what I'm using now, and it's "working" but there are still reconciliations being triggered under the hood, though much less frequently: https://github.com/hops-ops/aws-auto-eks-cluster/actions/runs/21084074461/job/60643913257

The change

This PR fixes a regression introduced when SSA was promoted to beta ( 2480bb0 ). At that time, provider-kubernetes began migrating legacy CSA field managers to SSA using csaupgrade.UpgradeManagedFieldsPatch . The migration predicate checks for a legacy manager name derived from the default REST user agent ( internal/controller/fieldmanager.go ), which is also used by controller-runtime Update calls for status and finalizers.

As a result, the provider's own status/finalizer updates are mistaken as legacy CSA ownership and the managed fields migration is retriggered continuously. The Object never reaches Ready because the controller keeps detecting a "needed" migration on each reconcile.

This change narrows the migration predicate to ignore:

  • managed fields entries for the status subresource, and
  • entries that only manage metadata.finalizers.

This keeps CSA->SSA migration intact for desired-state fields (including spec-less resources like ConfigMap/Secret/RBAC with top-level data / rules ), while avoiding false positives from controller bookkeeping.

Tests are added to ensure spec-less CSA ownership triggers migration and to validate that status/finalizers-only entries do not.

Effect:

  • Object-managed ProviderConfig reaches Ready when it’s successfully created, while still allowing CSA->SSA migration for spec-owning entries.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

Tests Added

Unit tests in internal/controller/{cluster,namespaced}/object/syncer_test.go:

StatusOnlyUpdateIsIgnored

Verifies status subresource entries don't trigger migration

Why ignoring status subresource entries is semantically correct:

Looking at the code in syncer.go:143-163:

func (s *SSAResourceSyncer) needSSAFieldManagerUpgrade(accessor metav1.Object) bool {
    // ...     
    for _, mfe := range mfes {             
        if mfe.Operation != metav1.ManagedFieldsOperationUpdate || !s.legacyCSAFieldManagers.Has(mfe.Manager) {  
            continue         
        }      
        if mfe.Subresource == "status" {  // <-- The change            
            continue         
        }      
        // ... 
    }          
}   

Status subresource semantics:

  1. Status updates use a separate API endpoint (/status subresource) - they're never part of SSA's "desired state"
    reconciliation
  2. SSA is for desired state (spec), not observed state (status) - csaupgrade.UpgradeManagedFieldsPatch is
    designed to migrate CSA ownership of desired state fields to SSA. Status is reported state, not applied state.
  3. No conflict resolution needed for status - Status is written by a single controller; there's no multi-actor
    merge scenario that SSA solves for
  4. Status updates never go through client.Apply - They use client.Update on the status subresource. The managed
    fields entry has Subresource: "status" specifically to distinguish it.

The original bug was in the predicate, not the migration

The original needSSAFieldManagerUpgrade was too broad - it matched the manager name without considering that:

  • Status subresource updates create managed field entries with the same manager name
  • Those entries have Subresource: "status" which semantically excludes them from CSA→SSA migration

So this isn't "changing behavior" - it's fixing the predicate to correctly express the intended behavior: only migrate CSA ownership of desired state fields. Status was never supposed to be considered for SSA migration; the original code just didn't filter it out.

FinalizersOnlyUpdateIsIgnored

Verifies finalizers-only metadata entries don't trigger migration

The key question: Are finalizers part of what SSA applies?

Looking at the flow:

  1. User provides a manifest (the desired state)
  2. Provider does client.Patch(ctx, desired, client.Apply, client.FieldOwner(...))
  3. The manifest contains spec, maybe labels/annotations - but not finalizers
  4. Finalizers are added/removed separately by the controller for lifecycle management

Finalizers are controller bookkeeping, not user desired state. The controller adds them via a separate Update
call (not Apply), so:

  • Migrating finalizer ownership to the SSA field manager would be incorrect
  • The SSA manager should only own fields it actually applies
  • Multiple controllers can each own their own finalizer entries

SpecUpdateTriggersUpgrade

Confirms spec ownership still triggers migration (regression guard)

TopLevelDataUpdateTriggersUpgrade

Confirms spec-less resources (ConfigMap/Secret) still migrate (regression guard)

NonLegacyManagerIsIgnored

Confirms only legacy CSA managers are considered

Reproduction repository:

I also created a repository to reproduce the bug, and show the patch being applied and fixing it:

  • Clone and run full-test.sh:
    • creates cluster and switches context
    • installs Crossplane and provider-kubernetes
    • creates .m ProviderConfig + Objects
    • confirms Ready=False before fix
    • clone and builds patched provider and image, loads image into kind
    • confirms Ready=True after fix

https://github.com/hops-ops/provider-kubernetes-patch-test-env

git clone https://github.com/hops-ops/provider-kubernetes-patch-test-env
cd provider-kubernetes-patch-test-env
PROVIDER_REPO=git@github.com:hops-ops/provider-kubernetes.git \
PROVIDER_REF=main \
./full-test.sh

This will set up the problem state, apply the patch from the PR, and then show the resource become ready.

Observed results:

  • Before patch: ProviderConfig Object Ready=False (Creating), Synced=True
  • After patch: ProviderConfig Object Ready=True (Available), Synced=True

AI usage

This is my first time diving into the kubernetes-provider codebase and I used AI to assist me in diagnosing the problem, so I definitely would like some eyes on this. I've been using crossplane for a few years though! :)

I asked Claude and Codex both separately to figure out the problem from a running reproduction of it, and they came up with similar answers. I made them review each other's work and they both decided that Codex's solution was better, and added some more tests to guard against regressions that Claude's work would have caused.

That said I still needed to spend a several hours digging in and making it easy to reproduce the before and after state, understanding more historical context, and asking AI to justify it's changes. This fixes the problem, but I want to make sure I'm not missing some wider contextual understanding.

Why AI thinks the change is safe

  • Migration should only occur when legacy CSA owned desired-state fields that SSA will manage. Status and finalizers are not part of the desired manifest and are not managed by SSA for these resources.
  • Provider-kubernetes updates finalizers/status with operation=Update and the legacy manager name. Treating those entries as "needs SSA migration" is a false positive and can keep the Object reconciling indefinitely (the "never Ready" failure).
  • Narrowing the predicate to ignore entries that only touch metadata.finalizers or status matches the intended migration behavior: migrate ownership for desired-state fields while ignoring controller-owned bookkeeping.
  • Checking for only for f:spec fields avoids this bug but risks skipping migration for resources whose desired fields are not under spec (e.g., ConfigMap/Secret/RBAC with data/rules at the top level).
  • Filtering out just status/finalizers preserves migration for non-spec resources while avoiding the false positives that cause the ready loop.

)

Signed-off-by: Patrick Lee Scott <pat@patscott.io>
@patrickleet patrickleet changed the title fix: object that creates a k8s.m Provider Config never becomes ready fix: object that creates a k8s.m ProviderConfig never becomes ready Jan 18, 2026
@ravilr
Copy link
Copy Markdown
Contributor

ravilr commented Feb 12, 2026

@erhancagirici @jeanduplessis this looks like a side effect of the maybeUpgradeFieldManagers() introduced in #416 . PTAL.

we're also seeing this, where the Object resource never becomes ready, despite syncing the underlying managed K8s resource successfully, and keep ending up seeing diff on Observe and issuing Update() every reconcile perpetually, after upgrading from v1.1.0 with explicit --enable-server-side-apply enabled to v1.2.0 . the managedFields handling seems to be wrongly making some assumptions which isn't true.

Note that the enable-server-side-apply itself is not the issue(that functionality has been around in the codebase since v0.15.0, and have been working well without issues when explicitly enabled through provider args), but the managedFields handling added in above PR in v1.2.0 seems to be the culprit.

patrickleet added a commit to hops-ops/aws-crossplane-stack that referenced this pull request Mar 10, 2026
Copy link
Copy Markdown
Collaborator

@erhancagirici erhancagirici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickleet thanks for the write-up and the PR. Looks good in general.

I wanted to add a e2e test example, but I cannot push to your fork.

Could you add the below manifest file to the PR and then add it to the UPTEST_EXAMPLE_LIST in Makefile ?

examples/namespaced/object/object-wrapped-providerconfig.yaml
apiVersion: kubernetes.m.crossplane.io/v1alpha1
kind: Object
metadata:
  name: sample-wrapped-providerconfig
  namespace: default
  annotations:
    uptest.upbound.io/timeout: "60"
spec:
  forProvider:
    manifest:
      apiVersion: kubernetes.m.crossplane.io/v1alpha1
      kind: ProviderConfig
      metadata:
        name: demo-provider-config
        namespace: default
        labels:
          foo: bar
      spec:
        credentials:
          source: Secret
          secretRef:
            namespace: default
            name: foo-cluster-config
            key: kubeconfig
  providerConfigRef:
    kind: ClusterProviderConfig
    name: kubernetes-provider

Copy link
Copy Markdown
Collaborator

@erhancagirici erhancagirici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! LGTM.

Since I could not hear back, and cannot add commits to your fork, I'll open a separate PR for adding the e2e test manifest.

@erhancagirici erhancagirici merged commit 0495771 into crossplane-contrib:main Mar 17, 2026
6 of 7 checks passed
@patrickleet
Copy link
Copy Markdown
Contributor Author

Just seeing your comments! Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants