Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix: ResourceGroup controller bugs #1568

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Feb 25, 2025

  • Fix e2e tests to correctly update the ResourceGroup status
  • Optimize e2e tests to watch, instead of polling
  • Fix the ResourceGroup controller to wait for status.observedGeneration updates by the applier inventory client. This fixes a race condition and prevents status update conflicts.
  • Fix TestStatusEnabledAndDisabled to allow changes to the status.generation by the applier.
  • Fix the ResourceGroup controller to skip condition updates that only change the timestamp. This avoids unnecessary updates.
  • Fix the ResourceGroup controller to correctly update the kstatus and the reconcile status, not just the kstatus.
  • Change ResourceGroup to skip setting Reconciling=True before imediately setting it to Reconciling=False. This avoids unnecessary updates.
  • Fix ResourceGroup controller to let the controller-manager handle retries with shared backoff, instead of using inline retries with independent backoff.
  • Fix ResourceGroup controller to log errors encountered when computing the reconcile status.
  • Fix ResourceGroup "root" controller to remove unnecessary "optimizations" which could cause updates to be missed. The core controller already handles these skips, and the controller-manager already handles de-duping identical events. This avoids temporarily incorrect status and delayed status updates.

Extracted:

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The ResourceGroup controller has been updating the Reconciling
status condition to status=True every time it attempts to reconcile.
This is unnecessary, because there are no side-effects during
reconciling, and no other client needs to know when the ResourceGroup
controller is reconciling. In fact, this causes several problems:
- kstatus constantly flaps from Current to InProgress
- the resourceVersion is constantly going up
- etcd is being filled up with changes, increasing memory and storage
- watch events are being propegated to controllers, like the RGC,
  which triggers perpetual unnecessary updates.

This change fixes that problem by skipping the initial condition
update. Now the Reconciling condition is only ever set to False,
either as FinishReconciling or ExceedTimeout.

This is reasonable, because the ResourceGroup controller doesn't act
like a standard operator. It is not updating the status to reflect
the spec. It's updating the status to reflect the status. In fact,
the applier doesn't act like a ResourceGroup operator either. The
applier updates the spec & the status at the same time, using the
ResourceGroup spec as a status of what has already happened in the
source of truth, rather than as a request for something to happen.
- Fix e2e tests to correctly update the ResourceGroup status
- Optimize e2e tests to watch, instead of polling
- Fix the ResourceGroup controller to wait for
  status.observedGeneration updates by the applier inventory client.
  This fixes a race condition and prevents status update conflicts.
- Fix TestStatusEnabledAndDisabled to allow changes to the
  status.generation by the applier.
- Fix the ResourceGroup controller to correctly update the kstatus
  and the reconcile status, not just the kstatus.
- Fix ResourceGroup controller to let the controller-manager handle
  retries with shared backoff, instead of using inline retries with
  independent backoff.
- Fix ResourceGroup controller to log errors encountered when
  computing the reconcile status.
- Fix ResourceGroup "root" controller to remove unnecessary
  "optimizations" which could cause updates to be missed.
  The core controller already handles these skips, and the
  controller-manager already handles de-duping identical events.
  This avoids temporarily incorrect status and delayed status updates.
@karlkfi karlkfi force-pushed the karl-cleanup-rgc-e2e branch from 7a37956 to be6cbac Compare February 26, 2025 00:50
Copy link

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
kpt-config-sync-presubmit-e2e-multi-repo-test-group1 be6cbac link true /test kpt-config-sync-presubmit-e2e-multi-repo-test-group1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant