Skip to content

Conversation

@a-hilaly
Copy link
Member

@a-hilaly a-hilaly commented Dec 7, 2025

Previously, when a ResourceGraphDefinition had issues (e.g., missing RBAC
permissions), the Register() method would block indefinitely on
cache.WaitForCacheSync() while holding a global lock, causing all other RGDs
to stall. This change makes registration non-blocking by moving the cache sync
wait into a goroutine and using channel-based notification when sync completes.

/hold

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 7, 2025
@a-hilaly a-hilaly force-pushed the dc/non-blocking-registrations branch from 7d84bf7 to 4f083e8 Compare December 8, 2025 09:06
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2025
@tjamet
Copy link
Contributor

tjamet commented Dec 8, 2025

IIRC the problem is then RGDController tries to register an instance controller to the dynamicController, it blocks any other RGD reconciliation because the dynamic controller holds a global lock that never gets released because lazyInforner.AddHandler never returns waiting for cache sync that never happens, correct?

In the current implementation, there are 2 identified cases:

We should recover when:

  • KRO got granted permissions to list/watch the parent resource
  • KRO got granted permissions to list/watch the child resource
  • The RGD got updated and removes the need to watch resources without permissions

From quick read, I we detect permission changes that would recover a functional state.
It looks like we are already watching for informer errors.
Would updating this handler for something like

	_ = inf.SetWatchErrorHandler(func(_ *cache.Reflector, err error) {
		w.log.V(1).Error(err, "watch error for lazy informer", "gvr", w.gvr)
		if isNoPermissions(err) {
			w.startErr = err
			w.Shutdown()
		}
	})

and update the handler registration:

		if !cache.WaitForCacheSync(w.done, w.informer.HasSynced) {
			w.log.Error(fmt.Errorf("cache sync failed"), "lazy informer sync failure", "gvr", w.gvr)
			if w.cancel!= nil{
				w.cancel()
			}
			w.cancel = nil
			w.informer = nil
			w.handlers = make(map[string]cache.ResourceEventHandlerRegistration)
			if w.startErr != nil {
				return w.startErr
			}
			return fmt.Errorf("failed to sync informer for %s", w.gvr)
		}

We would also need to make sure this error is correctly propagated to the RGD reconciler and the RGD status is updated as expected

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@a-hilaly a-hilaly changed the title WIP: Implement asynchronous watch registration [do-not-merge][experiment]: asynchronous watch registration Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants