Skip to content

feat: migrating to validating admission policies for our validating logic (1/)#708

Merged
michaelawyu (michaelawyu) merged 6 commits into
kubefleet-dev:mainfrom
michaelawyu:feat/migrate-to-vap-1
May 27, 2026
Merged

feat: migrating to validating admission policies for our validating logic (1/)#708
michaelawyu (michaelawyu) merged 6 commits into
kubefleet-dev:mainfrom
michaelawyu:feat/migrate-to-vap-1

Conversation

@michaelawyu

Copy link
Copy Markdown
Member

Description of your changes

This PR is the first of many following PRs to migrate to validating admission policies for applicable validation logic.

Specifically, this PR:

  • adds a admission policy manager for managing admission policies when the agent starts up. This manager is best used for environments where policies cannot be easily deployed/refreshed as part of the agent Helm charts.
  • adds two policy generator: one for denying writes to pods + replica sets in non-reserved namespaces, the other one for denying writes to service accounts and creating token requests in reserved namespaces.

Releated to #707.

I have:

  • [*] Associated this change with a known KubeFleet Issue (Bug, Feature, etc).
  • [*] Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • [*] Integration tests

Special notes for your reviewer

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu

Copy link
Copy Markdown
Member Author

Please note that for this PR the manager is not enabled in the hub agent yet.

To control the PR size, additional tests will be submitted as a separate PR.

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the first slice of work to move KubeFleet's validation logic from validating webhooks to Kubernetes ValidatingAdmissionPolicies (VAP). It adds a startup-time policy manager that reconciles a configurable set of generated VAPs and their bindings, plus two initial generators (deny pod/replicaset writes outside reserved namespaces, deny serviceaccount/token writes in reserved namespaces). It also exports the namespace prefix constants from pkg/utils.

Changes:

  • New pkg/admissionpolicymanager package: manager that create-or-updates and garbage-collects VAPs labeled by Fleet, with two generators (pods/replicasets, serviceaccounts/tokenrequests).
  • Integration test suite (envtest) validating the manager produces the expected VAP/VAPB objects (effects deferred to E2E).
  • Exports KubePrefix, FleetPrefix, FleetMemberNamespacePrefix from pkg/utils/common.go for reuse in the generator configs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
pkg/utils/common.go Exports previously unexported namespace prefix constants so the admission policy generator configs can reuse them.
pkg/admissionpolicymanager/manager.go Core policy manager: reflection-based generator selection, create-or-update with retry, GC of orphan policies/bindings via label selector.
pkg/admissionpolicymanager/configs.go Top-level configuration struct exposing per-generator config and a default.
pkg/admissionpolicymanager/podsnreplicasets.go Generator producing a VAP that denies CREATE of pods/replicasets outside reserved namespaces.
pkg/admissionpolicymanager/svcaccountsntokenreqs.go Generator producing a VAP that denies serviceaccount CRUD and tokenrequest CREATE in reserved namespaces, with user/group whitelist.
pkg/admissionpolicymanager/suite_test.go Ginkgo/envtest suite bootstrap; starts the policy manager once.
pkg/admissionpolicymanager/manager_integration_test.go Asserts the expected VAP and VAPB objects exist with correct spec/labels.

Comment thread pkg/admissionpolicymanager/svcaccountsntokenreqs.go
Comment thread pkg/admissionpolicymanager/svcaccountsntokenreqs.go
Comment thread pkg/admissionpolicymanager/podsnreplicasets.go
Comment thread pkg/admissionpolicymanager/podsnreplicasets.go Outdated
Comment thread pkg/admissionpolicymanager/manager.go Outdated
Comment thread pkg/utils/common.go Outdated
Comment on lines +102 to +132
// Exempt whitelisted users from this admission policy.
for _, username := range g.WhitelistedUsernames {
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username))
}
// Exempt whitelisted user groups from this admission policy.
for _, userGroup := range g.WhitelistedUserGroups {
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, userGroup))
}
// Exempt requests from the Kubernetes scheduler, any of the nodes, and (esp.) the
// Kubernetes controller manager from this admission policy.
//
// Important: the Kubernetes controller manager, when deployed with the option
// --use-service-account-credentials=true, creates a service account token for many of its controllers
// and uses those tokens to authenticate to the Kubernetes API server. It retrieves a token
// via the TokenRequest API; failure to exempt this scenario may lead to critical errors.
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeSchedulerUserName))
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeControllerManagerUserName))
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, kubeNodeUserGroup))
// Exempt requests from admin users from this admission policy.
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, adminUserGroup))

celExprAcc := strings.Join(celExprAccSegs, " || ")

celExprNSSegs := []string{}
for _, prefix := range g.ReservedNamespacePrefixes {
celExprNSSegs = append(celExprNSSegs, fmt.Sprintf(`object.metadata.namespace.startsWith("%s")`, prefix))
}
celExprNS := strings.Join(celExprNSSegs, " || ")

celExpr := fmt.Sprintf("!(%s) || (%s)", celExprNS, celExprAcc)

@michaelawyu michaelawyu (michaelawyu) May 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this restriction is added in consistency with our existing webhook setup. system:masters is supposed to have unlimited access to all resources BTW.

Comment on lines +119 to +121
// The error message has been set to match with the checks in some of the E2E tests
// in our existing release pipeline.
Message: "creating pods and replicas is disallowed in the fleet hub cluster",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N/A: as mentioned, the error msg is explicitly kept in the same style as some of the existing checks for releasing purposes.

Comment thread pkg/admissionpolicymanager/podsnreplicasets.go Outdated
Comment thread pkg/admissionpolicymanager/manager_integration_test.go
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Comment thread pkg/admissionpolicymanager/manager.go
Comment thread pkg/admissionpolicymanager/podsnreplicasets.go Outdated
Comment thread pkg/admissionpolicymanager/manager.go Outdated
type ValidatingAdmissionPolicyGenerator interface {
Name() string
Validate() error
Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of independent Policies and PolicyBindings members, can we do

type PolicyWithBindings struct {
    Policy   admissionregistrationv1.ValidatingAdmissionPolicy
    Bindings []admissionregistrationv1.ValidatingAdmissionPolicyBinding
}

type ValidatingAdmissionPolicyGenerator interface {
	Name() string
	Validate() error
    PoliciesWithBindings() []PolicyWithBindings
}

?

The concern is that right now the relationship between Policy and Bindings is not explicit and is more error-proned (e.g. someone created a Policy but no Binding; someone deleted a Policy but forgot to delete the corresponding Bindings)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess... I do not particularly have a strong opinion on this; to me personally it is really an internal implementation that are more guarded by various checks and tests rather than structural limitations, not to say that even with this limitation it still does not guarantee any form of correctness (one can still write misconfigurations). But it is true that it doesn't hurt either.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually do this to guarantee correctness such that the bindings are always tied to the policy?

// NewPolicyWithBindings creates a PolicyWithBindings and automatically sets
// each binding's spec.policyName to match the policy's name.
func NewPolicyWithBindings(
    policy *admissionregistrationv1.ValidatingAdmissionPolicy,
    bindings []*admissionregistrationv1.ValidatingAdmissionPolicyBinding,
) PolicyWithBindings {
    for _, b := range bindings {
        b.Spec.PolicyName = policy.Name
    }
    return PolicyWithBindings{
        Policy:   policy,
        Bindings: bindings,
    }
}

@michaelawyu michaelawyu (michaelawyu) May 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bindings have other fields that depend on semantics (it has its own resource selectors, filters, etc.), right? Ultimately any check we do here won't be able to guarantee that it will work (especially considering that this is part of the internal logic that functions statically); it is still the UTs/ITs/E2Es that guard against any misconfigurations. My concern is that, if we are going to check them there anyway (and it's a word by word, diff-based check), is it really needed to add a very superficial check in the logic?

I am not saying that I am not a fan of defensive programming - for a dynamic system, a bit of check can help. But, for the internal, static logic part of the code, I have my reservations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about defensive programming. I actually don't like some defensive programming which is usually error handling things that can hardly happen.

I am advocating for structuring the code logically so that it helps understanding. The reason I made this comment was that when I first read this piece of code, one of the first thoughts that came to my mind was "Does the Policies and Bindings under the same VAP generator have to be related to each other in a certain way? Or are they just arbitrary collections of Policies and Bindings." And I thought about it a little bit and figured that we mostly likely do not want to have arbitrary collections of Policies and Bindings. But one Policy can have multiple Bindings. So I suggested this interface.

And the intention is that the interface itself is logical and self-documenting so that each time this piece of code is read or each time a new generator gets reviewed, people don't have to spend more time thinking about the relationship between the Policies and Bindings.

Comment thread pkg/admissionpolicymanager/manager.go Outdated
Comment thread pkg/admissionpolicymanager/manager.go Outdated
Comment thread pkg/admissionpolicymanager/manager.go Outdated
Comment thread pkg/admissionpolicymanager/manager.go
Rule: admissionregistrationv1.Rule{
APIGroups: []string{""},
APIVersions: []string{"v1"},
Resources: []string{"pods"},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have constants for "pods" and "replicasets" in webhook.go. Shall we start sharing them across webhook package and this new package by pulling them out into a common package? And eventually webhook.go will be deleted in one PR together with other webhook logic? Want to make sure that we still have these as constants after replacing webhook by VAP.

@michaelawyu michaelawyu (michaelawyu) May 22, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! For this one, I didn't set the code to share variables as the plan is to drop certain webhooks after VAPs become fully available (not all of them can be replaced by VAPs I fear). At this moment they are both present for release safety reasons.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep webhook.go forever I think it is important to share the constants across webhook and admissionpolicymanager so that they are not being kept in 2 places?

@michaelawyu michaelawyu (michaelawyu) May 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! Sorry for any misunderstanding; what I meant was that, webhook.go will stay, but not all the webhooks within it. The pod/replicaset webhooks can be dropped safely eventually as VAPs can replace the logic completely; on the other hand, we have webhooks that cross-refs objects, which are not yet ready to migrate to VAP.

Comment thread pkg/admissionpolicymanager/svcaccountsntokenreqs.go Outdated

// Exempt whitelisted users from this admission policy.
for _, username := range g.WhitelistedUsernames {
celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username))

@weng271190436 Wei Weng (weng271190436) May 21, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a concern that this CEL construction code is not very readable. Might be difficult for someone to spot a mistake if there is any.

Wondering if it is worth investing in a celbuilder.go like this?

// CELExpr represents a CEL expression fragment.
type CELExpr string

// String returns the CEL expression as a string.
func (e CELExpr) String() string {
    return string(e)
}

// UsernameEquals returns a CEL expression matching a specific username.
func UsernameEquals(username string) CELExpr {
    return CELExpr(fmt.Sprintf(`request.userInfo.username == "%s"`, username))
}

// InGroup returns a CEL expression matching membership in a group.
func InGroup(group string) CELExpr {
    return CELExpr(fmt.Sprintf(`"%s" in request.userInfo.groups`, group))
}

// NamespaceStartsWith returns a CEL expression matching a namespace prefix.
func NamespaceStartsWith(prefix string) CELExpr {
    return CELExpr(fmt.Sprintf(`request.namespace.startsWith("%s")`, prefix))
}

// Or joins expressions with ||.
func Or(exprs ...CELExpr) CELExpr {
    parts := make([]string, len(exprs))
    for i, e := range exprs {
        parts[i] = string(e)
    }
    return CELExpr(strings.Join(parts, " || "))
}

// Not negates an expression.
func Not(expr CELExpr) CELExpr {
    return CELExpr(fmt.Sprintf("!(%s)", expr))
}

then we can build and use CEL like this:

func (g *SvcAccountsAndTokenRequestsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy {
    // Namespace condition
    var nsClauses []CELExpr
    for _, prefix := range g.ReservedNamespacePrefixes {
        nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
    }

    // User exemptions
    var exemptions []CELExpr
    for _, u := range g.WhitelistedUsernames {
        exemptions = append(exemptions, UsernameEquals(u))
    }
    for _, grp := range g.WhitelistedUserGroups {
        exemptions = append(exemptions, InGroup(grp))
    }
    exemptions = append(exemptions,
        UsernameEquals(kubeSchedulerUserName),
        UsernameEquals(kubeControllerManagerUserName),
        InGroup(kubeNodeUserGroup),
        InGroup(adminUserGroup),
    )

    // Allow if: not reserved namespace OR user is exempt
    celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))

    policy := &admissionregistrationv1.ValidatingAdmissionPolicy{
        // ...
        Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{
            Validations: []admissionregistrationv1.Validation{
                {
                    Expression: celExpr.String(),
                    Message:    "writing service accounts in reserved namespaces or requesting tokens from such service accounts is disallowed",
                    Reason:     ptr.To(metav1.StatusReasonForbidden),
                },
            },
        },
    }
    return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podsnreplicasets.go can also do similar:

func (g *PodsAndReplicaSetsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy {
    var nsClauses []CELExpr
    for _, prefix := range g.ReservedNamespacePrefixes {
        nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
    }

    celExpr := Not(Or(nsClauses...))

    policy := &admissionregistrationv1.ValidatingAdmissionPolicy{
        // ...
        Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{
            Validations: []admissionregistrationv1.Validation{
                {
                    Expression: celExpr.String(),
                    Message:    "creating pods and replicas is disallowed in the fleet hub cluster",
                    Reason:     ptr.To(metav1.StatusReasonForbidden),
                },
            },
        },
    }
    return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy}
}

@michaelawyu michaelawyu (michaelawyu) May 22, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! Thanks for this 🙏 TBH I was thinking a bit about roughly the same thing when the code was gen'd, but the concern back then was that, to add proper support, we would need to add operators like Or, And, Not, plus brackets. In the end it makes things more complicated but does not really help with the readability much.

For example, the one we use with service accounts and tokens would become:

Or(Not(Or(ExprA, ExprB, ...)), Bracket(Or(ExprX, ExprY, ...)))

and it feels a bit even more difficult to reason about?

@michaelawyu michaelawyu (michaelawyu) May 22, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is part of the internal code (i.e., not public-facing API), I feel that it might be better not to over-complicate things (not to say that the current organization is perfect, that's for sure)?

Plus, LLMs are pretty good at generating the truth tables for CEL expressions; it might be easier to just let it validate. I will add some UTs in future PRs to validate the CEL expressions generated just to be extra safe.

@weng271190436 Wei Weng (weng271190436) May 22, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think CEL expression wrappers make it easier for human to read. Regarding the LLM argument, my concern is that, when LLM generated code caused issue, humans are still being held accountable (not just the person who created the PR, but also the PR reviewer).

    var nsClauses []CELExpr
    for _, prefix := range g.ReservedNamespacePrefixes {
        nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
    }

    // User exemptions
    var exemptions []CELExpr
    for _, u := range g.WhitelistedUsernames {
        exemptions = append(exemptions, UsernameEquals(u))
    }
    for _, grp := range g.WhitelistedUserGroups {
        exemptions = append(exemptions, InGroup(grp))
    }
    exemptions = append(exemptions,
        UsernameEquals(kubeSchedulerUserName),
        UsernameEquals(kubeControllerManagerUserName),
        InGroup(kubeNodeUserGroup),
        InGroup(adminUserGroup),
    )

    // Allow if: not reserved namespace OR user is exempt
    celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))

@michaelawyu michaelawyu (michaelawyu) May 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I 100% agree with the LLM responsibility part; it is always us to bear the consequences of changes, not the machine. What I was trying to say, however, is that, for this part of the logic, LLM can easily generate the end result plus its truth table; and we can easily build UTs based on the truth table to make sure that the expression is correct both in form and in semantics.

For the builder part, to do things truly correctly and elegantly, we need to start building a CEL expression tree. But the ultimate question to me, at this moment, is that, is it really worth it? We will be adding many layers of indirections and a composing logic of its own for a part of the code that is a) internal, b) largely static, c) only run once at startup, d) with limited complexity (a few clauses at max.) and e) not likely to change much over time.

I understand that this is a very subjective discussion at the moment; to unblock progress (as we really need to move to VAPs sooner), let me add a TO-DO to the code; as soon as we know that VAP setup is working (it is not connected to the main entrypoint yet) and releases seem fine, I will send another PR to switch to tree-building. Would this be alright in your opinion?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can add TODO and come back to this. Thank you.

Also want to explain why I keep emphasizing on readability. When one writes a piece of code, it always makes sense to the person writing it because the person knows the piece of code very well. When a person reads it for the first time, the perspective is very different and appropriate abstraction can really help.

I am suggesting this because I spent significant time trying to understand this part of the code when reading it. So I suggested this because I think it helps with understanding.

    exemptions = append(exemptions,
        UsernameEquals(kubeSchedulerUserName),
        UsernameEquals(kubeControllerManagerUserName),
        InGroup(kubeNodeUserGroup),
        InGroup(adminUserGroup),
    )

    // Allow if: not reserved namespace OR user is exempt
    celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))

And readability is not just for someone reading this code 2 months from now because they want to change the validation logic slightly or add a new VAP generator. The most immediate benefit is that it helps review. Sometimes people open a PR, read a little bit, find it hard to understand, and they will just give up.

And well-defined structure helps. To a reader, there is a big difference between custom strings.Join and Or because or is a well-understood logical operator while strings.Join can be used for a lot of different things. By appealing to well-understood concepts, the code is self-documenting and easy to understand.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Wei! Yeah, added the TO-DO comment (added to the package level for better visibility). Let me refactor this to a CEL expression tree based solution in a later PR after VAPs are known to be working.

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu

Copy link
Copy Markdown
Member Author

Merging this PR to unblock progress -> if there's any concern, please let me know.

@michaelawyu michaelawyu (michaelawyu) merged commit 515a744 into kubefleet-dev:main May 27, 2026
14 of 16 checks passed
@michaelawyu michaelawyu (michaelawyu) deleted the feat/migrate-to-vap-1 branch May 27, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants