Skip to content

Conversation

@0xavi0
Copy link
Contributor

@0xavi0 0xavi0 commented Nov 24, 2025

This PR introduces changes to how GitRepos are deleted, as well as their dependent resources.

Two main changes are introduced:

  • The function that deletes the GitRepo has been modified to check for any pending resources, especially Bundles and ImageScans, since previously the GitRepo was removed before these resources were fully deleted.

  • A new controller is introduced for Content resources to track the number of BundleDeployments that reference the same one. This controller aims to eliminate the conflicts that occurred when updating or deleting BundleDeployments, as they were being updated or removed concurrently while modifying a shared Content resource.

With these changes, the goal is to eliminate both update conflicts (which triggered additional Reconcile calls in the controllers) and potential race conditions when deleting and recreating GitRepos that still had dependent resources pending deletion.

Refers to #4192
Refers to #4251

Additional Information

Checklist

- [ ] I have updated the documentation via a pull request in the
fleet-docs repository.

@0xavi0 0xavi0 self-assigned this Nov 24, 2025
@0xavi0 0xavi0 force-pushed the 4192-gitrepo-delete-cascade branch from a24d029 to fcc8657 Compare November 24, 2025 11:33
This PR introduces changes to how GitRepos are deleted, as well as their dependent resources.

Two main changes are introduced:

* The function that deletes the GitRepo has been modified to check for any pending resources, especially *Bundles* and *ImageScans*, since previously the *GitRepo* was removed before these resources were fully deleted.

* A new controller is introduced for *Content* resources to track the number of *BundleDeployments* that reference the same one.
  This controller aims to eliminate the conflicts that occurred when updating or deleting *BundleDeployments*, as they were being updated or removed concurrently while modifying a shared *Content* resource.

With these changes, the goal is to eliminate both update conflicts (which triggered additional Reconcile calls in the controllers) and potential race conditions when deleting and recreating *GitRepos* that still had dependent resources pending deletion.

Signed-off-by: Xavi Garcia <[email protected]>
@0xavi0 0xavi0 force-pushed the 4192-gitrepo-delete-cascade branch from fcc8657 to f59b831 Compare November 24, 2025 12:43
@0xavi0 0xavi0 added this to Fleet Nov 24, 2025
@0xavi0 0xavi0 added this to the v2.13.1 milestone Nov 24, 2025
@0xavi0 0xavi0 marked this pull request as ready for review November 24, 2025 13:09
@0xavi0 0xavi0 requested a review from a team as a code owner November 24, 2025 13:09
@0xavi0 0xavi0 moved this to Needs QA review in Fleet Nov 25, 2025
@0xavi0 0xavi0 moved this from Needs QA review to 👀 In review in Fleet Nov 25, 2025
Copy link
Contributor

@weyfonk weyfonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels cleaner than the previous approach, if slightly more "manual".
Finalizers were not the right tool for the job, it seems :)

// resource, so the downstream agent can deploy them.
// +nullable
Content []byte `json:"content,omitempty"`
Content []byte `json:"content,omitempty"` // Content is a byte array, which contains the manifests of a bundle. The bundle resources are copied into the bundledeployment's content resource, so the downstream agent can deploy them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this second sentence needed here? It already appears on L27-28.

workersOpts.Schedule = w
}

if d := os.Getenv("CONTENT_RECONCILER_WORKERS"); d != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This env var should probably be set in the fleet-controller deployment.

Comment on lines +663 to +665
func GetNamespaceAndNameID(namespace, name string) string {
return fmt.Sprintf("%s/%s", namespace, name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get rid of the commented out code above, then this function is no longer needed.

return nil // continue iterating no matter what
})

return errors.Join(errs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will format errors with a newline between each error message (skipping nil errors). I'm not sure how well that would work in the Rancher UI.
This may not be critical, but perhaps something to test.

ShardID: shardID,
Workers: workersOpts.Content,
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to content controller", "controller", "Content")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typo is very relatable :D

Suggested change
setupLog.Error(err, "unable to content controller", "controller", "Content")
setupLog.Error(err, "unable to create controller", "controller", "Content")

bd.Spec.ValuesHash = valuesHash

// When content resources are stored in etcd, we need to keep track of the content resource so they
// are properly gargabe collected by the content controller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// are properly gargabe collected by the content controller.
// are properly garbage-collected by the content controller.

// SetupWithManager sets up the controller with the Manager.
func (r *ContentReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
Named("content-controller").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for this state:

By default, controllers are named using the lowercase version of their kind.

Except for a few (drift, gitrepo status and helmops status), we don't name our reconcilers. We may want to harmonise their names for metrics at some point.

In the meantime, for consistency, I'd be tempted to set:

Suggested change
Named("content-controller").
Named("content"). // equivalent to the default, as per the docs, so we might as well skip this line

or

Suggested change
Named("content-controller").
Named("content-reconciler"). // same format as in the drift controller

That being said, I'm not sure how useful metrics for this controller would be, so that's definitely not a major issue ;)

) (controllerutil.OperationResult, *fleet.BundleDeployment, error) {
logger := l.WithValues("deploymentID", bd.Spec.DeploymentID)

// When content resources are stored in etcd, we need to add finalizers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to update unit tests for this controller (e.g. here), to reflect the fact that content management is now owned by another controller.

}
}

return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice to (unit-)test this, just to be on the safe side. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

2 participants