Skip to content

Implement ManagedServiceAccount resources reconciliation#130

Open
yxun wants to merge 10 commits into
stolostron:mainfrom
yxun:msa-cr
Open

Implement ManagedServiceAccount resources reconciliation#130
yxun wants to merge 10 commits into
stolostron:mainfrom
yxun:msa-cr

Conversation

@yxun

@yxun yxun commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

This PR implements the creation and cleanup of the ManagedServiceAccount resources for each managed cluster in the ClusterSet.

The goal of this PR is getting API access secret by creating a ManagedServiceAccount custom resource for each managed cluster. The distribution of those API access secrets or tokens will be implemented in the next PR.

References:

This PR is the third part of the original large PR:
#55

The controller reconciliation need an update method for existing MSA. I created an issue in #120. That TODO item will be added in follow up PRs.
e2e test suite need new tests. I created an issue in #133 . That TODO item will be added in follow up PRs.

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yxun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

DiscoveryConfig.TokenValidity transitions from string to metav1.Duration with default 360h, minimum 10m, and validation restricted to hours/minutes/seconds. New file endpoint_discovery.go implements three Reconciler methods: createManagedServiceAccounts creates one MSA per managed cluster with token rotation validity configured from TokenValidity, cleanupManagedServiceAccounts removes MSAs and istio-remote Secrets not in the current ClusterSet, and deleteAllManagedServiceAccounts performs full teardown during mesh deletion. The mesh controller's doReconcile invokes creation (when cert-manager is configured) and cleanup (after manifest work); handleDeletion invokes full deletion before finalizer removal. RBAC permissions updated for managedserviceaccounts. Integration tests cover MSA lifecycle: creation on cluster addition, cleanup on cluster removal or label clearing, skipping for unlabeled clusters, and multi-mesh retention/deletion semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Multiple test quality issues found in Endpoint discovery tests: (1) Eventually() calls lack explicit timeout/interval parameters (e.g., line 697, 722); (2) Assertion messages missing (e.g., `.Shoul... Add timeout parameters to Eventually/Consistently calls (e.g., Eventually(..., 30*time.Second, 1*time.Second)), add assertion messages like Expect(...).Should(Succeed(), "expected MSA creation").
✅ Passed checks (8 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementation of ManagedServiceAccount resource reconciliation, which is the core objective across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 7 new test titles in the "Endpoint discovery" section use stable, deterministic names with no dynamic values, variables, or concatenation.
No-Weak-Crypto ✅ Passed No weak cryptography patterns found. PR uses only standard Kubernetes APIs for token management via ManagedServiceAccount addon, with no custom crypto, weak algorithms, or insecure comparisons.
Container-Privileges ✅ Passed PR modifies only Go source and documentation files; no K8s manifests are added/modified. Existing deployment already enforces secure security contexts without privileges, root access, or dangerous...
No-Sensitive-Data-In-Logs ✅ Passed All logging statements in the PR only include safe identifiers (names, namespaces); no passwords, tokens, API keys, PII, or secret data are exposed in logs.
Description check ✅ Passed The PR description accurately describes the implementation of ManagedServiceAccount resource creation and cleanup for managed clusters, matching the changesets across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/hub/mesh/controller.go (1)

153-161: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

RBAC markers are missing permissions required by new endpoint-discovery lifecycle calls.

New reconcile paths perform Get/List/Create/Delete on ManagedServiceAccount and Delete on Secret, but the markers here do not grant those verbs/resources. This will fail with forbidden in real deployments.

At minimum, add kubebuilder RBAC for:

  • authentication.open-cluster-management.io / managedserviceaccounts: get;list;watch;create;update;patch;delete
  • core secrets: include delete (currently only get;list;watch)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/hub/mesh/controller.go` around lines 153 - 161, The RBAC markers in the
controller are missing permissions required by new endpoint-discovery lifecycle
calls. Add a new kubebuilder RBAC marker for the
authentication.open-cluster-management.io API group with resource
managedserviceaccounts and verbs get;list;watch;create;update;patch;delete to
support the reconcile paths that perform Get, List, Create, and Delete
operations on ManagedServiceAccount resources. Additionally, update the existing
kubebuilder RBAC marker for core API group secrets resource to include the
delete verb alongside the current get;list;watch verbs, since the new reconcile
paths also delete Secrets.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/hub/mesh/endpoint_discovery.go`:
- Line 35: The MSA identity and selectors are scoped only by mesh name, creating
collision risks when meshes with the same name exist in different namespaces. At
the MSA name creation in the format string at line 35 (where msaName is
constructed), include the mesh namespace in addition to the mesh name to create
a namespace-qualified identity, either by directly incorporating mesh.Namespace
into the name or by hashing the (namespace, name) tuple. Apply the same
namespace-qualified identity change at lines 71-72 and 112-113 where other MSA
name references occur. Additionally, for all MSA list selectors that appear in
the code (around the affected lines), include MeshNamespaceLabel: mesh.Namespace
in the selector to ensure MSA queries are scoped to the specific mesh namespace
and prevent cross-mesh resource deletions during cleanup.
- Around line 37-40: The current error handling in the Get call for the
ManagedServiceAccount resource treats all errors as "not found" and proceeds to
create, which masks transient, RBAC, and API errors. Refactor the error handling
to explicitly check if the returned error is a NotFound error using the
appropriate error checking utility from the client-go library. If the error is
NotFound, proceed with the create logic; if the error is any other type, log the
error with context and return/fail fast rather than attempting to create. This
ensures transient and permission-related failures are properly surfaced instead
of being silently treated as non-existent resources.
- Around line 49-52: The `TokenValidity` field is optional in the API shape but
is being unconditionally dereferenced with the `*` operator when initializing
the `Rotation` struct in the `ManagedServiceAccountRotation` assignment. To
prevent reconciliation panics when the field is absent or nil, add a nil check
before dereferencing `mesh.Spec.Security.Discovery.TokenValidity` and provide a
sensible default duration value when the pointer is nil. Use a conditional
expression or helper function to select either the dereferenced value or the
default based on whether the pointer is nil.

In `@test/integration/controller_test.go`:
- Around line 657-678: Add explicit timeout and polling interval parameters to
all Eventually and Consistently calls that interact with the cluster, as these
calls currently lack deterministic timing controls. For the Eventually call in
the "should process a ManagedServiceAccount" test block (which retrieves
ManagedServiceAccount via k8sClient.Get), add timeout and polling interval
parameters following the pattern Eventually(func() error { ... },
30*time.Second, 1*time.Second).Should(Succeed()). Apply this same pattern to all
other Eventually and Consistently assertions throughout the test file that
interact with cluster operations to ensure test reliability and meet the coding
guidelines for async cluster operations.

---

Outside diff comments:
In `@pkg/hub/mesh/controller.go`:
- Around line 153-161: The RBAC markers in the controller are missing
permissions required by new endpoint-discovery lifecycle calls. Add a new
kubebuilder RBAC marker for the authentication.open-cluster-management.io API
group with resource managedserviceaccounts and verbs
get;list;watch;create;update;patch;delete to support the reconcile paths that
perform Get, List, Create, and Delete operations on ManagedServiceAccount
resources. Additionally, update the existing kubebuilder RBAC marker for core
API group secrets resource to include the delete verb alongside the current
get;list;watch verbs, since the new reconcile paths also delete Secrets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 17acd6b9-7cf0-47e1-8c8c-2ab62ec7d496

📥 Commits

Reviewing files that changed from the base of the PR and between 62ab8dd and 843b812.

⛔ Files ignored due to path filters (1)
  • pkg/apis/mesh/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated*
📒 Files selected for processing (4)
  • pkg/apis/mesh/v1alpha1/types.go
  • pkg/hub/mesh/controller.go
  • pkg/hub/mesh/endpoint_discovery.go
  • test/integration/controller_test.go

Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread pkg/hub/mesh/endpoint_discovery.go
Comment thread pkg/hub/mesh/endpoint_discovery.go
Comment thread test/integration/controller_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/integration/controller_test.go`:
- Around line 725-728: The test currently only verifies deletion of
ManagedServiceAccounts for one mesh (meshName-istio-reader) in both clusters,
but the test scenario involves two meshes. To properly verify full multi-mesh
cleanup semantics, add two additional util.ExpectResourceDeleted calls after the
existing assertions that verify deletion of the otherMesh-istio-reader
ManagedServiceAccount in cluster1 and cluster2. Each new call should follow the
same pattern as the existing util.ExpectResourceDeleted invocations but use
fmt.Sprintf("%s-istio-reader", otherMesh) instead of meshName to ensure all
mesh-owned resources are properly cleaned up.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dd038605-8d26-45d3-88ae-5012a76e617e

📥 Commits

Reviewing files that changed from the base of the PR and between 843b812 and a9e7acc.

📒 Files selected for processing (1)
  • test/integration/controller_test.go

Comment thread test/integration/controller_test.go Outdated
Comment thread pkg/hub/mesh/controller.go
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread pkg/apis/mesh/v1alpha1/types.go
Comment thread pkg/hub/mesh/controller.go
Comment thread test/integration/controller_test.go
Comment thread test/integration/controller_test.go
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
@yxun

yxun commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/hold

yxun added 2 commits June 17, 2026 10:32
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
@yxun

yxun commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

/unhold

Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Comment thread chart/templates/clusterrole.yaml

@jewertow jewertow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I assume you will address multi cluster secret creation in a follow-up PR, right?

Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
@yxun

yxun commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

LGTM, but I assume you will address multi cluster secret creation in a follow-up PR, right?

@jewertow , yes, that was in the original PR and I will have some updates and address those in the next smaller PR.
The following steps will be as follows:

MSA CR creation, update
--> The MSA controller (which is not a concern from multicluster-mesh-addon side) creates a SA and an API server access Secret with access token
--> multicluster-mesh-addon controller builds an Istio remote secret using the API server access Secret token and adds required labels.
--> The controller creates ManifestWork for distributing Istio remote secrets.

yxun added 4 commits June 18, 2026 12:18
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>

@sridhargaddam sridhargaddam left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxun, please take a look at the e2e test failures. Otherwise, LGTM.

@yxun

yxun commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-all

@yxun

yxun commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

/test golangci-lint

Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
Comment thread pkg/hub/mesh/endpoint_discovery.go
Comment thread pkg/hub/mesh/endpoint_discovery.go Outdated
Signed-off-by: Yuanlin Xu <yuanlin.xu@redhat.com>
@yxun

yxun commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

The fix of e2e test error is in #157

It follows the following doc:
https://github.com/open-cluster-management-io/managed-serviceaccount#installation-steps

Comment thread hack/dev-env.sh Outdated

install_managed_serviceaccount() {
local hub_kubeconfig
hub_kubeconfig="$(echo "${DEV_KUBE_DIR}/${HUB}.config")"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason of this change:
There was a failure in the job of the previous commit:
https://github.com/stolostron/multicluster-mesh-addon/actions/runs/28129345197/job/83303594782

/home/runner/work/multicluster-mesh-addon/multicluster-mesh-addon/hack/dev-env.sh: line 273: kubeconfig_for: command not found

I could never reproduce this error from my local run. Still not clear why the github workflow job was not able to find that function name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #152

@mkolesnik mkolesnik left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, also please remove parts that were moved to #157

Comment thread hack/dev-env.sh Outdated

install_managed_serviceaccount() {
local hub_kubeconfig
hub_kubeconfig="$(echo "${DEV_KUBE_DIR}/${HUB}.config")"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #152


msaName := fmt.Sprintf("%s-%s-%s", mesh.Namespace, msaRootWord, mesh.Name)

var validity metav1.Duration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubebuilder default (+kubebuilder:default="360h" on TokenValidity) guarantees this field is always populated for objects created through the API.

This nil check and the hardcoded 360 * time.Hour are dead code with a hidden duplicate default.
Just use it directly.

return reconcile.Result{}, fmt.Errorf("failed to cleanup ManifestWorks: %w", err)
}

klog.Infof("Handling deletion for ManagedServiceAccount resources managed by mesh %s/%s", mesh.Namespace, mesh.Name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to log this, you already log each deleted MSA

Suggested change
klog.Infof("Handling deletion for ManagedServiceAccount resources managed by mesh %s/%s", mesh.Namespace, mesh.Name)

expectMeshNotReady(otherMesh, otherNs)
util.DeleteResource(ctx, k8sClient, &meshv1alpha1.MultiClusterMesh{}, otherMesh, otherNs)
msa := &msav1beta1.ManagedServiceAccount{}
Expect(k8sClient.Get(ctx, types.NamespacedName{Name: fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), Namespace: cluster1}, msa)).Should(Succeed())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a point-in-time check that could pass before cleanup runs.
Use Consistently to verify MSAs remain present over a polling window, similar to other tests.

Comment on lines +28 to +31
if len(clusters) == 0 {
klog.V(4).Info("The ClusterSet has no managed cluster")
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is inconsistent with the rest of the controller

}

// Create ManagedServiceAccount resources for each cluster.
if err := r.createManagedServiceAccounts(ctx, mesh, clusters); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this needs to run for each cluster, it makes more sense to put this inside the main for _, cluster := range clusters loop on L283 and handle each cluster with ensureEndpointDiscovery or some similar terminology.

util.CreateMultiClusterMesh(ctx, k8sClient, otherMesh, otherNs, testClusterSet)
})

It("should keep the ManagedServiceAccount when one mesh is deleted", func() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests use addon-owned (shared resource) semantics, but MSAs are mesh-owned.
Each mesh creates its own MSAs, so there's no shared resource to "keep."

"should keep the ManagedServiceAccount when one mesh is deleted" (L755): deletes otherMesh, checks meshName's MSAs survive, but never checks otherMesh's MSAs are deleted.
This is a tautology for mesh-owned resources.

"should delete the ManagedServiceAccount when both meshes are deleted" (L763): passes even with broken scoping since everything is deleted.

Replace both with one test that verifies correct scoping:
delete one mesh, assert its MSAs are gone AND the other mesh's MSAs survive.

For example:

It("should delete only the deleted mesh's MSAs", func() {
    expectMeshNotReady(otherMesh, otherNs)
    // Verify both meshes have MSAs
    expectManagedServiceAccount(fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), cluster1)
    expectManagedServiceAccount(fmt.Sprintf("%s-%s-%s", otherNs, msaRootWord, otherMesh), cluster1)
    // Delete one mesh
    util.DeleteResource(ctx, k8sClient, &meshv1alpha1.MultiClusterMesh{}, otherMesh, otherNs)
    // Deleted mesh's MSAs are gone
    util.ExpectResourceDeleted(ctx, k8sClient, &msav1beta1.ManagedServiceAccount{},
        fmt.Sprintf("%s-%s-%s", otherNs, msaRootWord, otherMesh), cluster1)
    util.ExpectResourceDeleted(ctx, k8sClient, &msav1beta1.ManagedServiceAccount{},
        fmt.Sprintf("%s-%s-%s", otherNs, msaRootWord, otherMesh), cluster2)
    // Surviving mesh's MSAs remain
    Expect(k8sClient.Get(ctx, types.NamespacedName{Name: fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), Namespace: cluster1}, &msav1beta1.ManagedServiceAccount{})).To(Succeed())
    Expect(k8sClient.Get(ctx, types.NamespacedName{Name: fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), Namespace: cluster2}, &msav1beta1.ManagedServiceAccount{})).To(Succeed())
})

Comment on lines +673 to +674
msa1 := expectManagedServiceAccount(fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), cluster1)
msa2 := expectManagedServiceAccount(fmt.Sprintf("%s-%s-%s", testNs, msaRootWord, meshName), cluster2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very repetitive, you can extract a simple func to format the MSA name:

	func expectedMSAName(meshNamespace, meshName string) string {
		return fmt.Sprintf("%s-istio-reader-%s", meshNamespace, meshName)
	}

Then you can use it inside expectManagedServiceAccount and in other places where you look for MSA

@yxun

yxun commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants