-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fix: ResourceGroup controller bugs #1568
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e80cfcf
to
7a37956
Compare
The ResourceGroup controller has been updating the Reconciling status condition to status=True every time it attempts to reconcile. This is unnecessary, because there are no side-effects during reconciling, and no other client needs to know when the ResourceGroup controller is reconciling. In fact, this causes several problems: - kstatus constantly flaps from Current to InProgress - the resourceVersion is constantly going up - etcd is being filled up with changes, increasing memory and storage - watch events are being propegated to controllers, like the RGC, which triggers perpetual unnecessary updates. This change fixes that problem by skipping the initial condition update. Now the Reconciling condition is only ever set to False, either as FinishReconciling or ExceedTimeout. This is reasonable, because the ResourceGroup controller doesn't act like a standard operator. It is not updating the status to reflect the spec. It's updating the status to reflect the status. In fact, the applier doesn't act like a ResourceGroup operator either. The applier updates the spec & the status at the same time, using the ResourceGroup spec as a status of what has already happened in the source of truth, rather than as a request for something to happen.
- Fix e2e tests to correctly update the ResourceGroup status - Optimize e2e tests to watch, instead of polling - Fix the ResourceGroup controller to wait for status.observedGeneration updates by the applier inventory client. This fixes a race condition and prevents status update conflicts. - Fix TestStatusEnabledAndDisabled to allow changes to the status.generation by the applier. - Fix the ResourceGroup controller to correctly update the kstatus and the reconcile status, not just the kstatus. - Fix ResourceGroup controller to let the controller-manager handle retries with shared backoff, instead of using inline retries with independent backoff. - Fix ResourceGroup controller to log errors encountered when computing the reconcile status. - Fix ResourceGroup "root" controller to remove unnecessary "optimizations" which could cause updates to be missed. The core controller already handles these skips, and the controller-manager already handles de-duping identical events. This avoids temporarily incorrect status and delayed status updates.
7a37956
to
be6cbac
Compare
@karlkfi: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Extracted: