[WIP] fix: ResourceGroup controller bugs #1568

karlkfi · 2025-02-25T21:30:02Z

Fix e2e tests to correctly update the ResourceGroup status
Optimize e2e tests to watch, instead of polling
Fix the ResourceGroup controller to wait for status.observedGeneration updates by the applier inventory client. This fixes a race condition and prevents status update conflicts.
Fix TestStatusEnabledAndDisabled to allow changes to the status.generation by the applier.
Fix the ResourceGroup controller to skip condition updates that only change the timestamp. This avoids unnecessary updates.
Fix the ResourceGroup controller to correctly update the kstatus and the reconcile status, not just the kstatus.
Change ResourceGroup to skip setting Reconciling=True before imediately setting it to Reconciling=False. This avoids unnecessary updates.
Fix ResourceGroup controller to let the controller-manager handle retries with shared backoff, instead of using inline retries with independent backoff.
Fix ResourceGroup controller to log errors encountered when computing the reconcile status.
Fix ResourceGroup "root" controller to remove unnecessary "optimizations" which could cause updates to be missed. The core controller already handles these skips, and the controller-manager already handles de-duping identical events. This avoids temporarily incorrect status and delayed status updates.

Extracted:

google-oss-prow · 2025-02-25T21:30:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The ResourceGroup controller has been updating the Reconciling status condition to status=True every time it attempts to reconcile. This is unnecessary, because there are no side-effects during reconciling, and no other client needs to know when the ResourceGroup controller is reconciling. In fact, this causes several problems: - kstatus constantly flaps from Current to InProgress - the resourceVersion is constantly going up - etcd is being filled up with changes, increasing memory and storage - watch events are being propegated to controllers, like the RGC, which triggers perpetual unnecessary updates. This change fixes that problem by skipping the initial condition update. Now the Reconciling condition is only ever set to False, either as FinishReconciling or ExceedTimeout. This is reasonable, because the ResourceGroup controller doesn't act like a standard operator. It is not updating the status to reflect the spec. It's updating the status to reflect the status. In fact, the applier doesn't act like a ResourceGroup operator either. The applier updates the spec & the status at the same time, using the ResourceGroup spec as a status of what has already happened in the source of truth, rather than as a request for something to happen.

- Fix e2e tests to correctly update the ResourceGroup status - Optimize e2e tests to watch, instead of polling - Fix the ResourceGroup controller to wait for status.observedGeneration updates by the applier inventory client. This fixes a race condition and prevents status update conflicts. - Fix TestStatusEnabledAndDisabled to allow changes to the status.generation by the applier. - Fix the ResourceGroup controller to correctly update the kstatus and the reconcile status, not just the kstatus. - Fix ResourceGroup controller to let the controller-manager handle retries with shared backoff, instead of using inline retries with independent backoff. - Fix ResourceGroup controller to log errors encountered when computing the reconcile status. - Fix ResourceGroup "root" controller to remove unnecessary "optimizations" which could cause updates to be missed. The core controller already handles these skips, and the controller-manager already handles de-duping identical events. This avoids temporarily incorrect status and delayed status updates.

google-oss-prow · 2025-02-26T01:09:17Z

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit-e2e-multi-repo-test-group1	`be6cbac`	link	true	`/test kpt-config-sync-presubmit-e2e-multi-repo-test-group1`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 25, 2025

google-oss-prow bot requested review from sdowell and tiffanny29631 February 25, 2025 21:30

google-oss-prow bot added the size/XL label Feb 25, 2025

karlkfi force-pushed the karl-cleanup-rgc-e2e branch from e80cfcf to 7a37956 Compare February 26, 2025 00:28

karlkfi added 2 commits February 25, 2025 16:47

karlkfi force-pushed the karl-cleanup-rgc-e2e branch from 7a37956 to be6cbac Compare February 26, 2025 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix: ResourceGroup controller bugs #1568

[WIP] fix: ResourceGroup controller bugs #1568

karlkfi commented Feb 25, 2025 •

edited

Loading

google-oss-prow bot commented Feb 25, 2025

google-oss-prow bot commented Feb 26, 2025

[WIP] fix: ResourceGroup controller bugs #1568

Are you sure you want to change the base?

[WIP] fix: ResourceGroup controller bugs #1568

Conversation

karlkfi commented Feb 25, 2025 • edited Loading

google-oss-prow bot commented Feb 25, 2025

google-oss-prow bot commented Feb 26, 2025

karlkfi commented Feb 25, 2025 •

edited

Loading