feat: migrating to validating admission policies for our validating logic (1/) by michaelawyu · Pull Request #708 · kubefleet-dev/kubefleet

michaelawyu (michaelawyu) · 2026-05-13T16:18:56Z

Description of your changes

This PR is the first of many following PRs to migrate to validating admission policies for applicable validation logic.

Specifically, this PR:

adds a admission policy manager for managing admission policies when the agent starts up. This manager is best used for environments where policies cannot be easily deployed/refreshed as part of the agent Helm charts.
adds two policy generator: one for denying writes to pods + replica sets in non-reserved namespaces, the other one for denying writes to service accounts and creating token requests in reserved namespaces.

Releated to #707.

I have:

[*] Associated this change with a known KubeFleet Issue (Bug, Feature, etc).
[*] Run make reviewable to ensure this PR is ready for review.

How has this code been tested

[*] Integration tests

Special notes for your reviewer

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

michaelawyu (michaelawyu) · 2026-05-13T16:19:31Z

Please note that for this PR the manager is not enabled in the hub agent yet.

To control the PR size, additional tests will be submitted as a separate PR.

codecov · 2026-05-13T16:55:03Z

Codecov Report

❌ Patch coverage is 70.47619% with 93 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/admissionpolicymanager/manager.go	51.38%	50 Missing and 20 partials ⚠️
...kg/admissionpolicymanager/svcaccountsntokenreqs.go	85.71%	6 Missing and 6 partials ⚠️
pkg/admissionpolicymanager/commons.go	61.11%	5 Missing and 2 partials ⚠️
pkg/admissionpolicymanager/podsnreplicasets.go	94.02%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR introduces the first slice of work to move KubeFleet's validation logic from validating webhooks to Kubernetes ValidatingAdmissionPolicies (VAP). It adds a startup-time policy manager that reconciles a configurable set of generated VAPs and their bindings, plus two initial generators (deny pod/replicaset writes outside reserved namespaces, deny serviceaccount/token writes in reserved namespaces). It also exports the namespace prefix constants from pkg/utils.

Changes:

New pkg/admissionpolicymanager package: manager that create-or-updates and garbage-collects VAPs labeled by Fleet, with two generators (pods/replicasets, serviceaccounts/tokenrequests).
Integration test suite (envtest) validating the manager produces the expected VAP/VAPB objects (effects deferred to E2E).
Exports KubePrefix, FleetPrefix, FleetMemberNamespacePrefix from pkg/utils/common.go for reuse in the generator configs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
pkg/utils/common.go	Exports previously unexported namespace prefix constants so the admission policy generator configs can reuse them.
pkg/admissionpolicymanager/manager.go	Core policy manager: reflection-based generator selection, create-or-update with retry, GC of orphan policies/bindings via label selector.
pkg/admissionpolicymanager/configs.go	Top-level configuration struct exposing per-generator config and a default.
pkg/admissionpolicymanager/podsnreplicasets.go	Generator producing a VAP that denies CREATE of pods/replicasets outside reserved namespaces.
pkg/admissionpolicymanager/svcaccountsntokenreqs.go	Generator producing a VAP that denies serviceaccount CRUD and tokenrequest CREATE in reserved namespaces, with user/group whitelist.
pkg/admissionpolicymanager/suite_test.go	Ginkgo/envtest suite bootstrap; starts the policy manager once.
pkg/admissionpolicymanager/manager_integration_test.go	Asserts the expected VAP and VAPB objects exist with correct spec/labels.

michaelawyu (michaelawyu) · 2026-05-18T23:58:55Z

+	// Exempt whitelisted users from this admission policy.
+	for _, username := range g.WhitelistedUsernames {
+		celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username))
+	}
+	// Exempt whitelisted user groups from this admission policy.
+	for _, userGroup := range g.WhitelistedUserGroups {
+		celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, userGroup))
+	}
+	// Exempt requests from the Kubernetes scheduler, any of the nodes, and (esp.) the
+	// Kubernetes controller manager from this admission policy.
+	//
+	// Important: the Kubernetes controller manager, when deployed with the option
+	// --use-service-account-credentials=true, creates a service account token for many of its controllers
+	// and uses those tokens to authenticate to the Kubernetes API server. It retrieves a token
+	// via the TokenRequest API; failure to exempt this scenario may lead to critical errors.
+	celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeSchedulerUserName))
+	celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeControllerManagerUserName))
+	celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, kubeNodeUserGroup))
+	// Exempt requests from admin users from this admission policy.
+	celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, adminUserGroup))
+
+	celExprAcc := strings.Join(celExprAccSegs, " || ")
+
+	celExprNSSegs := []string{}
+	for _, prefix := range g.ReservedNamespacePrefixes {
+		celExprNSSegs = append(celExprNSSegs, fmt.Sprintf(`object.metadata.namespace.startsWith("%s")`, prefix))
+	}
+	celExprNS := strings.Join(celExprNSSegs, " || ")
+
+	celExpr := fmt.Sprintf("!(%s) || (%s)", celExprNS, celExprAcc)
+


Note: this restriction is added in consistency with our existing webhook setup. system:masters is supposed to have unlimited access to all resources BTW.

michaelawyu (michaelawyu) · 2026-05-18T23:59:50Z

+					// The error message has been set to match with the checks in some of the E2E tests
+					// in our existing release pipeline.
+					Message: "creating pods and replicas is disallowed in the fleet hub cluster",


N/A: as mentioned, the error msg is explicitly kept in the same style as some of the existing checks for releasing purposes.

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Wei Weng (weng271190436) · 2026-05-21T18:30:29Z

+type ValidatingAdmissionPolicyGenerator interface {
+	Name() string
+	Validate() error
+	Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy


Instead of independent Policies and PolicyBindings members, can we do

type PolicyWithBindings struct { Policy admissionregistrationv1.ValidatingAdmissionPolicy Bindings []admissionregistrationv1.ValidatingAdmissionPolicyBinding } type ValidatingAdmissionPolicyGenerator interface { Name() string Validate() error PoliciesWithBindings() []PolicyWithBindings }

?

The concern is that right now the relationship between Policy and Bindings is not explicit and is more error-proned (e.g. someone created a Policy but no Binding; someone deleted a Policy but forgot to delete the corresponding Bindings)

Yeah, I guess... I do not particularly have a strong opinion on this; to me personally it is really an internal implementation that are more guarded by various checks and tests rather than structural limitations, not to say that even with this limitation it still does not guarantee any form of correctness (one can still write misconfigurations). But it is true that it doesn't hurt either.

You can actually do this to guarantee correctness such that the bindings are always tied to the policy?

// NewPolicyWithBindings creates a PolicyWithBindings and automatically sets // each binding's spec.policyName to match the policy's name. func NewPolicyWithBindings( policy *admissionregistrationv1.ValidatingAdmissionPolicy, bindings []*admissionregistrationv1.ValidatingAdmissionPolicyBinding, ) PolicyWithBindings { for _, b := range bindings { b.Spec.PolicyName = policy.Name } return PolicyWithBindings{ Policy: policy, Bindings: bindings, } }

Bindings have other fields that depend on semantics (it has its own resource selectors, filters, etc.), right? Ultimately any check we do here won't be able to guarantee that it will work (especially considering that this is part of the internal logic that functions statically); it is still the UTs/ITs/E2Es that guard against any misconfigurations. My concern is that, if we are going to check them there anyway (and it's a word by word, diff-based check), is it really needed to add a very superficial check in the logic?

I am not saying that I am not a fan of defensive programming - for a dynamic system, a bit of check can help. But, for the internal, static logic part of the code, I have my reservations.

It is not about defensive programming. I actually don't like some defensive programming which is usually error handling things that can hardly happen.

I am advocating for structuring the code logically so that it helps understanding. The reason I made this comment was that when I first read this piece of code, one of the first thoughts that came to my mind was "Does the Policies and Bindings under the same VAP generator have to be related to each other in a certain way? Or are they just arbitrary collections of Policies and Bindings." And I thought about it a little bit and figured that we mostly likely do not want to have arbitrary collections of Policies and Bindings. But one Policy can have multiple Bindings. So I suggested this interface.

And the intention is that the interface itself is logical and self-documenting so that each time this piece of code is read or each time a new generator gets reviewed, people don't have to spend more time thinking about the relationship between the Policies and Bindings.

Wei Weng (weng271190436) · 2026-05-21T18:55:31Z

+							Rule: admissionregistrationv1.Rule{
+								APIGroups:   []string{""},
+								APIVersions: []string{"v1"},
+								Resources:   []string{"pods"},


We have constants for "pods" and "replicasets" in webhook.go. Shall we start sharing them across webhook package and this new package by pulling them out into a common package? And eventually webhook.go will be deleted in one PR together with other webhook logic? Want to make sure that we still have these as constants after replacing webhook by VAP.

Hi Wei! For this one, I didn't set the code to share variables as the plan is to drop certain webhooks after VAPs become fully available (not all of them can be replaced by VAPs I fear). At this moment they are both present for release safety reasons.

If we are going to keep webhook.go forever I think it is important to share the constants across webhook and admissionpolicymanager so that they are not being kept in 2 places?

Hi Wei! Sorry for any misunderstanding; what I meant was that, webhook.go will stay, but not all the webhooks within it. The pod/replicaset webhooks can be dropped safely eventually as VAPs can replace the logic completely; on the other hand, we have webhooks that cross-refs objects, which are not yet ready to migrate to VAP.

Wei Weng (weng271190436) · 2026-05-21T19:41:53Z

+
+	// Exempt whitelisted users from this admission policy.
+	for _, username := range g.WhitelistedUsernames {
+		celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username))


Have a concern that this CEL construction code is not very readable. Might be difficult for someone to spot a mistake if there is any.

Wondering if it is worth investing in a celbuilder.go like this?

// CELExpr represents a CEL expression fragment. type CELExpr string // String returns the CEL expression as a string. func (e CELExpr) String() string { return string(e) } // UsernameEquals returns a CEL expression matching a specific username. func UsernameEquals(username string) CELExpr { return CELExpr(fmt.Sprintf(`request.userInfo.username == "%s"`, username)) } // InGroup returns a CEL expression matching membership in a group. func InGroup(group string) CELExpr { return CELExpr(fmt.Sprintf(`"%s" in request.userInfo.groups`, group)) } // NamespaceStartsWith returns a CEL expression matching a namespace prefix. func NamespaceStartsWith(prefix string) CELExpr { return CELExpr(fmt.Sprintf(`request.namespace.startsWith("%s")`, prefix)) } // Or joins expressions with ||. func Or(exprs ...CELExpr) CELExpr { parts := make([]string, len(exprs)) for i, e := range exprs { parts[i] = string(e) } return CELExpr(strings.Join(parts, " || ")) } // Not negates an expression. func Not(expr CELExpr) CELExpr { return CELExpr(fmt.Sprintf("!(%s)", expr)) }

then we can build and use CEL like this:

func (g *SvcAccountsAndTokenRequestsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy { // Namespace condition var nsClauses []CELExpr for _, prefix := range g.ReservedNamespacePrefixes { nsClauses = append(nsClauses, NamespaceStartsWith(prefix)) } // User exemptions var exemptions []CELExpr for _, u := range g.WhitelistedUsernames { exemptions = append(exemptions, UsernameEquals(u)) } for _, grp := range g.WhitelistedUserGroups { exemptions = append(exemptions, InGroup(grp)) } exemptions = append(exemptions, UsernameEquals(kubeSchedulerUserName), UsernameEquals(kubeControllerManagerUserName), InGroup(kubeNodeUserGroup), InGroup(adminUserGroup), ) // Allow if: not reserved namespace OR user is exempt celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...)) policy := &admissionregistrationv1.ValidatingAdmissionPolicy{ // ... Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{ Validations: []admissionregistrationv1.Validation{ { Expression: celExpr.String(), Message: "writing service accounts in reserved namespaces or requesting tokens from such service accounts is disallowed", Reason: ptr.To(metav1.StatusReasonForbidden), }, }, }, } return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy} }

podsnreplicasets.go can also do similar:

func (g *PodsAndReplicaSetsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy { var nsClauses []CELExpr for _, prefix := range g.ReservedNamespacePrefixes { nsClauses = append(nsClauses, NamespaceStartsWith(prefix)) } celExpr := Not(Or(nsClauses...)) policy := &admissionregistrationv1.ValidatingAdmissionPolicy{ // ... Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{ Validations: []admissionregistrationv1.Validation{ { Expression: celExpr.String(), Message: "creating pods and replicas is disallowed in the fleet hub cluster", Reason: ptr.To(metav1.StatusReasonForbidden), }, }, }, } return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy} }

Hi Wei! Thanks for this 🙏 TBH I was thinking a bit about roughly the same thing when the code was gen'd, but the concern back then was that, to add proper support, we would need to add operators like Or, And, Not, plus brackets. In the end it makes things more complicated but does not really help with the readability much.

For example, the one we use with service accounts and tokens would become:

Or(Not(Or(ExprA, ExprB, ...)), Bracket(Or(ExprX, ExprY, ...)))

and it feels a bit even more difficult to reason about?

Considering this is part of the internal code (i.e., not public-facing API), I feel that it might be better not to over-complicate things (not to say that the current organization is perfect, that's for sure)?

Plus, LLMs are pretty good at generating the truth tables for CEL expressions; it might be easier to just let it validate. I will add some UTs in future PRs to validate the CEL expressions generated just to be extra safe.

I still think CEL expression wrappers make it easier for human to read. Regarding the LLM argument, my concern is that, when LLM generated code caused issue, humans are still being held accountable (not just the person who created the PR, but also the PR reviewer).

var nsClauses []CELExpr for _, prefix := range g.ReservedNamespacePrefixes { nsClauses = append(nsClauses, NamespaceStartsWith(prefix)) } // User exemptions var exemptions []CELExpr for _, u := range g.WhitelistedUsernames { exemptions = append(exemptions, UsernameEquals(u)) } for _, grp := range g.WhitelistedUserGroups { exemptions = append(exemptions, InGroup(grp)) } exemptions = append(exemptions, UsernameEquals(kubeSchedulerUserName), UsernameEquals(kubeControllerManagerUserName), InGroup(kubeNodeUserGroup), InGroup(adminUserGroup), ) // Allow if: not reserved namespace OR user is exempt celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))

Ah, I 100% agree with the LLM responsibility part; it is always us to bear the consequences of changes, not the machine. What I was trying to say, however, is that, for this part of the logic, LLM can easily generate the end result plus its truth table; and we can easily build UTs based on the truth table to make sure that the expression is correct both in form and in semantics.

For the builder part, to do things truly correctly and elegantly, we need to start building a CEL expression tree. But the ultimate question to me, at this moment, is that, is it really worth it? We will be adding many layers of indirections and a composing logic of its own for a part of the code that is a) internal, b) largely static, c) only run once at startup, d) with limited complexity (a few clauses at max.) and e) not likely to change much over time.

I understand that this is a very subjective discussion at the moment; to unblock progress (as we really need to move to VAPs sooner), let me add a TO-DO to the code; as soon as we know that VAP setup is working (it is not connected to the main entrypoint yet) and releases seem fine, I will send another PR to switch to tree-building. Would this be alright in your opinion?

Sure can add TODO and come back to this. Thank you.

Also want to explain why I keep emphasizing on readability. When one writes a piece of code, it always makes sense to the person writing it because the person knows the piece of code very well. When a person reads it for the first time, the perspective is very different and appropriate abstraction can really help.

I am suggesting this because I spent significant time trying to understand this part of the code when reading it. So I suggested this because I think it helps with understanding.

exemptions = append(exemptions, UsernameEquals(kubeSchedulerUserName), UsernameEquals(kubeControllerManagerUserName), InGroup(kubeNodeUserGroup), InGroup(adminUserGroup), ) // Allow if: not reserved namespace OR user is exempt celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))

And readability is not just for someone reading this code 2 months from now because they want to change the validation logic slightly or add a new VAP generator. The most immediate benefit is that it helps review. Sometimes people open a PR, read a little bit, find it hard to understand, and they will just give up.

And well-defined structure helps. To a reader, there is a big difference between custom strings.Join and Or because or is a well-understood logical operator while strings.Join can be used for a lot of different things. By appealing to well-understood concepts, the code is self-documenting and easy to understand.

Thanks, Wei! Yeah, added the TO-DO comment (added to the package level for better visibility). Let me refactor this to a CEL expression tree based solution in a later PR after VAPs are known to be working.

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

michaelawyu (michaelawyu) · 2026-05-27T05:52:37Z

Merging this PR to unblock progress -> if there's any concern, please let me know.

michaelawyu (michaelawyu) added 2 commits May 14, 2026 01:57

Added VAP manager

65001c1

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Minor fixes

82268c5

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Yetkin Timocin (ytimocin) requested a review from Copilot May 18, 2026 16:12

Copilot started reviewing on behalf of Yetkin Timocin (ytimocin) May 18, 2026 16:14 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

michaelawyu (michaelawyu) added 2 commits May 19, 2026 10:11

Minor fixes

21a5b94

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Minor fixes

a873cca

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Wei Weng (weng271190436) reviewed May 21, 2026

View reviewed changes

michaelawyu (michaelawyu) added 2 commits May 23, 2026 03:12

Minor fixes

5d74e45

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Minor fixes

5e5483a

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Wei Weng (weng271190436) approved these changes May 26, 2026

View reviewed changes

michaelawyu (michaelawyu) merged commit 515a744 into kubefleet-dev:main May 27, 2026
14 of 16 checks passed

michaelawyu (michaelawyu) deleted the feat/migrate-to-vap-1 branch May 27, 2026 05:52

Conversation

michaelawyu (michaelawyu) commented May 13, 2026

Description of your changes

How has this code been tested

Special notes for your reviewer

Uh oh!

michaelawyu (michaelawyu) commented May 13, 2026

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelawyu (michaelawyu) May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu (michaelawyu) May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu (michaelawyu) May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu (michaelawyu) May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Wei Weng (weng271190436) May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu (michaelawyu) May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu (michaelawyu) May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wei Weng (weng271190436) May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 18, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 23, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 22, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 23, 2026 •

edited

Loading

Wei Weng (weng271190436) May 21, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 22, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 22, 2026 •

edited

Loading

Wei Weng (weng271190436) May 22, 2026 •

edited

Loading

michaelawyu (michaelawyu) May 23, 2026 •

edited

Loading