feat: migrating to validating admission policies for our validating logic (1/)#708
Conversation
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
|
Please note that for this PR the manager is not enabled in the hub agent yet. To control the PR size, additional tests will be submitted as a separate PR. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR introduces the first slice of work to move KubeFleet's validation logic from validating webhooks to Kubernetes ValidatingAdmissionPolicies (VAP). It adds a startup-time policy manager that reconciles a configurable set of generated VAPs and their bindings, plus two initial generators (deny pod/replicaset writes outside reserved namespaces, deny serviceaccount/token writes in reserved namespaces). It also exports the namespace prefix constants from pkg/utils.
Changes:
- New
pkg/admissionpolicymanagerpackage: manager that create-or-updates and garbage-collects VAPs labeled by Fleet, with two generators (pods/replicasets, serviceaccounts/tokenrequests). - Integration test suite (
envtest) validating the manager produces the expected VAP/VAPB objects (effects deferred to E2E). - Exports
KubePrefix,FleetPrefix,FleetMemberNamespacePrefixfrompkg/utils/common.gofor reuse in the generator configs.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/utils/common.go | Exports previously unexported namespace prefix constants so the admission policy generator configs can reuse them. |
| pkg/admissionpolicymanager/manager.go | Core policy manager: reflection-based generator selection, create-or-update with retry, GC of orphan policies/bindings via label selector. |
| pkg/admissionpolicymanager/configs.go | Top-level configuration struct exposing per-generator config and a default. |
| pkg/admissionpolicymanager/podsnreplicasets.go | Generator producing a VAP that denies CREATE of pods/replicasets outside reserved namespaces. |
| pkg/admissionpolicymanager/svcaccountsntokenreqs.go | Generator producing a VAP that denies serviceaccount CRUD and tokenrequest CREATE in reserved namespaces, with user/group whitelist. |
| pkg/admissionpolicymanager/suite_test.go | Ginkgo/envtest suite bootstrap; starts the policy manager once. |
| pkg/admissionpolicymanager/manager_integration_test.go | Asserts the expected VAP and VAPB objects exist with correct spec/labels. |
| // Exempt whitelisted users from this admission policy. | ||
| for _, username := range g.WhitelistedUsernames { | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username)) | ||
| } | ||
| // Exempt whitelisted user groups from this admission policy. | ||
| for _, userGroup := range g.WhitelistedUserGroups { | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, userGroup)) | ||
| } | ||
| // Exempt requests from the Kubernetes scheduler, any of the nodes, and (esp.) the | ||
| // Kubernetes controller manager from this admission policy. | ||
| // | ||
| // Important: the Kubernetes controller manager, when deployed with the option | ||
| // --use-service-account-credentials=true, creates a service account token for many of its controllers | ||
| // and uses those tokens to authenticate to the Kubernetes API server. It retrieves a token | ||
| // via the TokenRequest API; failure to exempt this scenario may lead to critical errors. | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeSchedulerUserName)) | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, kubeControllerManagerUserName)) | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, kubeNodeUserGroup)) | ||
| // Exempt requests from admin users from this admission policy. | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`"%s" in request.userInfo.groups`, adminUserGroup)) | ||
|
|
||
| celExprAcc := strings.Join(celExprAccSegs, " || ") | ||
|
|
||
| celExprNSSegs := []string{} | ||
| for _, prefix := range g.ReservedNamespacePrefixes { | ||
| celExprNSSegs = append(celExprNSSegs, fmt.Sprintf(`object.metadata.namespace.startsWith("%s")`, prefix)) | ||
| } | ||
| celExprNS := strings.Join(celExprNSSegs, " || ") | ||
|
|
||
| celExpr := fmt.Sprintf("!(%s) || (%s)", celExprNS, celExprAcc) | ||
|
|
There was a problem hiding this comment.
Note: this restriction is added in consistency with our existing webhook setup. system:masters is supposed to have unlimited access to all resources BTW.
| // The error message has been set to match with the checks in some of the E2E tests | ||
| // in our existing release pipeline. | ||
| Message: "creating pods and replicas is disallowed in the fleet hub cluster", |
There was a problem hiding this comment.
N/A: as mentioned, the error msg is explicitly kept in the same style as some of the existing checks for releasing purposes.
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
| type ValidatingAdmissionPolicyGenerator interface { | ||
| Name() string | ||
| Validate() error | ||
| Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy |
There was a problem hiding this comment.
Instead of independent Policies and PolicyBindings members, can we do
type PolicyWithBindings struct {
Policy admissionregistrationv1.ValidatingAdmissionPolicy
Bindings []admissionregistrationv1.ValidatingAdmissionPolicyBinding
}
type ValidatingAdmissionPolicyGenerator interface {
Name() string
Validate() error
PoliciesWithBindings() []PolicyWithBindings
}?
The concern is that right now the relationship between Policy and Bindings is not explicit and is more error-proned (e.g. someone created a Policy but no Binding; someone deleted a Policy but forgot to delete the corresponding Bindings)
There was a problem hiding this comment.
Yeah, I guess... I do not particularly have a strong opinion on this; to me personally it is really an internal implementation that are more guarded by various checks and tests rather than structural limitations, not to say that even with this limitation it still does not guarantee any form of correctness (one can still write misconfigurations). But it is true that it doesn't hurt either.
There was a problem hiding this comment.
You can actually do this to guarantee correctness such that the bindings are always tied to the policy?
// NewPolicyWithBindings creates a PolicyWithBindings and automatically sets
// each binding's spec.policyName to match the policy's name.
func NewPolicyWithBindings(
policy *admissionregistrationv1.ValidatingAdmissionPolicy,
bindings []*admissionregistrationv1.ValidatingAdmissionPolicyBinding,
) PolicyWithBindings {
for _, b := range bindings {
b.Spec.PolicyName = policy.Name
}
return PolicyWithBindings{
Policy: policy,
Bindings: bindings,
}
}There was a problem hiding this comment.
Bindings have other fields that depend on semantics (it has its own resource selectors, filters, etc.), right? Ultimately any check we do here won't be able to guarantee that it will work (especially considering that this is part of the internal logic that functions statically); it is still the UTs/ITs/E2Es that guard against any misconfigurations. My concern is that, if we are going to check them there anyway (and it's a word by word, diff-based check), is it really needed to add a very superficial check in the logic?
I am not saying that I am not a fan of defensive programming - for a dynamic system, a bit of check can help. But, for the internal, static logic part of the code, I have my reservations.
There was a problem hiding this comment.
It is not about defensive programming. I actually don't like some defensive programming which is usually error handling things that can hardly happen.
I am advocating for structuring the code logically so that it helps understanding. The reason I made this comment was that when I first read this piece of code, one of the first thoughts that came to my mind was "Does the Policies and Bindings under the same VAP generator have to be related to each other in a certain way? Or are they just arbitrary collections of Policies and Bindings." And I thought about it a little bit and figured that we mostly likely do not want to have arbitrary collections of Policies and Bindings. But one Policy can have multiple Bindings. So I suggested this interface.
And the intention is that the interface itself is logical and self-documenting so that each time this piece of code is read or each time a new generator gets reviewed, people don't have to spend more time thinking about the relationship between the Policies and Bindings.
| Rule: admissionregistrationv1.Rule{ | ||
| APIGroups: []string{""}, | ||
| APIVersions: []string{"v1"}, | ||
| Resources: []string{"pods"}, |
There was a problem hiding this comment.
We have constants for "pods" and "replicasets" in webhook.go. Shall we start sharing them across webhook package and this new package by pulling them out into a common package? And eventually webhook.go will be deleted in one PR together with other webhook logic? Want to make sure that we still have these as constants after replacing webhook by VAP.
There was a problem hiding this comment.
Hi Wei! For this one, I didn't set the code to share variables as the plan is to drop certain webhooks after VAPs become fully available (not all of them can be replaced by VAPs I fear). At this moment they are both present for release safety reasons.
There was a problem hiding this comment.
If we are going to keep webhook.go forever I think it is important to share the constants across webhook and admissionpolicymanager so that they are not being kept in 2 places?
There was a problem hiding this comment.
Hi Wei! Sorry for any misunderstanding; what I meant was that, webhook.go will stay, but not all the webhooks within it. The pod/replicaset webhooks can be dropped safely eventually as VAPs can replace the logic completely; on the other hand, we have webhooks that cross-refs objects, which are not yet ready to migrate to VAP.
|
|
||
| // Exempt whitelisted users from this admission policy. | ||
| for _, username := range g.WhitelistedUsernames { | ||
| celExprAccSegs = append(celExprAccSegs, fmt.Sprintf(`request.userInfo.username == "%s"`, username)) |
There was a problem hiding this comment.
Have a concern that this CEL construction code is not very readable. Might be difficult for someone to spot a mistake if there is any.
Wondering if it is worth investing in a celbuilder.go like this?
// CELExpr represents a CEL expression fragment.
type CELExpr string
// String returns the CEL expression as a string.
func (e CELExpr) String() string {
return string(e)
}
// UsernameEquals returns a CEL expression matching a specific username.
func UsernameEquals(username string) CELExpr {
return CELExpr(fmt.Sprintf(`request.userInfo.username == "%s"`, username))
}
// InGroup returns a CEL expression matching membership in a group.
func InGroup(group string) CELExpr {
return CELExpr(fmt.Sprintf(`"%s" in request.userInfo.groups`, group))
}
// NamespaceStartsWith returns a CEL expression matching a namespace prefix.
func NamespaceStartsWith(prefix string) CELExpr {
return CELExpr(fmt.Sprintf(`request.namespace.startsWith("%s")`, prefix))
}
// Or joins expressions with ||.
func Or(exprs ...CELExpr) CELExpr {
parts := make([]string, len(exprs))
for i, e := range exprs {
parts[i] = string(e)
}
return CELExpr(strings.Join(parts, " || "))
}
// Not negates an expression.
func Not(expr CELExpr) CELExpr {
return CELExpr(fmt.Sprintf("!(%s)", expr))
}then we can build and use CEL like this:
func (g *SvcAccountsAndTokenRequestsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy {
// Namespace condition
var nsClauses []CELExpr
for _, prefix := range g.ReservedNamespacePrefixes {
nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
}
// User exemptions
var exemptions []CELExpr
for _, u := range g.WhitelistedUsernames {
exemptions = append(exemptions, UsernameEquals(u))
}
for _, grp := range g.WhitelistedUserGroups {
exemptions = append(exemptions, InGroup(grp))
}
exemptions = append(exemptions,
UsernameEquals(kubeSchedulerUserName),
UsernameEquals(kubeControllerManagerUserName),
InGroup(kubeNodeUserGroup),
InGroup(adminUserGroup),
)
// Allow if: not reserved namespace OR user is exempt
celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))
policy := &admissionregistrationv1.ValidatingAdmissionPolicy{
// ...
Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{
Validations: []admissionregistrationv1.Validation{
{
Expression: celExpr.String(),
Message: "writing service accounts in reserved namespaces or requesting tokens from such service accounts is disallowed",
Reason: ptr.To(metav1.StatusReasonForbidden),
},
},
},
}
return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy}
}There was a problem hiding this comment.
podsnreplicasets.go can also do similar:
func (g *PodsAndReplicaSetsVAPGenerator) Policies() []*admissionregistrationv1.ValidatingAdmissionPolicy {
var nsClauses []CELExpr
for _, prefix := range g.ReservedNamespacePrefixes {
nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
}
celExpr := Not(Or(nsClauses...))
policy := &admissionregistrationv1.ValidatingAdmissionPolicy{
// ...
Spec: admissionregistrationv1.ValidatingAdmissionPolicySpec{
Validations: []admissionregistrationv1.Validation{
{
Expression: celExpr.String(),
Message: "creating pods and replicas is disallowed in the fleet hub cluster",
Reason: ptr.To(metav1.StatusReasonForbidden),
},
},
},
}
return []*admissionregistrationv1.ValidatingAdmissionPolicy{policy}
}There was a problem hiding this comment.
Hi Wei! Thanks for this 🙏 TBH I was thinking a bit about roughly the same thing when the code was gen'd, but the concern back then was that, to add proper support, we would need to add operators like Or, And, Not, plus brackets. In the end it makes things more complicated but does not really help with the readability much.
For example, the one we use with service accounts and tokens would become:
Or(Not(Or(ExprA, ExprB, ...)), Bracket(Or(ExprX, ExprY, ...)))
and it feels a bit even more difficult to reason about?
There was a problem hiding this comment.
Considering this is part of the internal code (i.e., not public-facing API), I feel that it might be better not to over-complicate things (not to say that the current organization is perfect, that's for sure)?
Plus, LLMs are pretty good at generating the truth tables for CEL expressions; it might be easier to just let it validate. I will add some UTs in future PRs to validate the CEL expressions generated just to be extra safe.
There was a problem hiding this comment.
I still think CEL expression wrappers make it easier for human to read. Regarding the LLM argument, my concern is that, when LLM generated code caused issue, humans are still being held accountable (not just the person who created the PR, but also the PR reviewer).
var nsClauses []CELExpr
for _, prefix := range g.ReservedNamespacePrefixes {
nsClauses = append(nsClauses, NamespaceStartsWith(prefix))
}
// User exemptions
var exemptions []CELExpr
for _, u := range g.WhitelistedUsernames {
exemptions = append(exemptions, UsernameEquals(u))
}
for _, grp := range g.WhitelistedUserGroups {
exemptions = append(exemptions, InGroup(grp))
}
exemptions = append(exemptions,
UsernameEquals(kubeSchedulerUserName),
UsernameEquals(kubeControllerManagerUserName),
InGroup(kubeNodeUserGroup),
InGroup(adminUserGroup),
)
// Allow if: not reserved namespace OR user is exempt
celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))There was a problem hiding this comment.
Ah, I 100% agree with the LLM responsibility part; it is always us to bear the consequences of changes, not the machine. What I was trying to say, however, is that, for this part of the logic, LLM can easily generate the end result plus its truth table; and we can easily build UTs based on the truth table to make sure that the expression is correct both in form and in semantics.
For the builder part, to do things truly correctly and elegantly, we need to start building a CEL expression tree. But the ultimate question to me, at this moment, is that, is it really worth it? We will be adding many layers of indirections and a composing logic of its own for a part of the code that is a) internal, b) largely static, c) only run once at startup, d) with limited complexity (a few clauses at max.) and e) not likely to change much over time.
I understand that this is a very subjective discussion at the moment; to unblock progress (as we really need to move to VAPs sooner), let me add a TO-DO to the code; as soon as we know that VAP setup is working (it is not connected to the main entrypoint yet) and releases seem fine, I will send another PR to switch to tree-building. Would this be alright in your opinion?
There was a problem hiding this comment.
Sure can add TODO and come back to this. Thank you.
Also want to explain why I keep emphasizing on readability. When one writes a piece of code, it always makes sense to the person writing it because the person knows the piece of code very well. When a person reads it for the first time, the perspective is very different and appropriate abstraction can really help.
I am suggesting this because I spent significant time trying to understand this part of the code when reading it. So I suggested this because I think it helps with understanding.
exemptions = append(exemptions,
UsernameEquals(kubeSchedulerUserName),
UsernameEquals(kubeControllerManagerUserName),
InGroup(kubeNodeUserGroup),
InGroup(adminUserGroup),
)
// Allow if: not reserved namespace OR user is exempt
celExpr := Or(Not(Or(nsClauses...)), Or(exemptions...))And readability is not just for someone reading this code 2 months from now because they want to change the validation logic slightly or add a new VAP generator. The most immediate benefit is that it helps review. Sometimes people open a PR, read a little bit, find it hard to understand, and they will just give up.
And well-defined structure helps. To a reader, there is a big difference between custom strings.Join and Or because or is a well-understood logical operator while strings.Join can be used for a lot of different things. By appealing to well-understood concepts, the code is self-documenting and easy to understand.
There was a problem hiding this comment.
Thanks, Wei! Yeah, added the TO-DO comment (added to the package level for better visibility). Let me refactor this to a CEL expression tree based solution in a later PR after VAPs are known to be working.
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
|
Merging this PR to unblock progress -> if there's any concern, please let me know. |
515a744
into
kubefleet-dev:main
Description of your changes
This PR is the first of many following PRs to migrate to validating admission policies for applicable validation logic.
Specifically, this PR:
Releated to #707.
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer