Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Check known required permissions for install before installing with the helm applier #1858

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

bentito
Copy link
Contributor

@bentito bentito commented Mar 10, 2025

Description

This is a successor PR to #1716 and is primarily the contributions of @trgeiger and @joelanford .

Goal and title, remain the same. Approach is a bit modified:

Pulls in RBAC authorization code from k8s.is/kubernetes, uses that code to check GET and other verb permissions as prelude to and as response from a Helm dry-run

To pull in the RBAC auth code concisely, repeatably and with warnings if the used code changes, we add a maintenance utility that adds the needed replace directives for all related staging modules (e.g., k8s.io/api, k8s.io/apimachinery, etc.) and they are automatically pinned to the corresponding published version.

All this code is initially called at

missingRules, err := h.PreAuthorizer.PreAuthorize(ctx, &ceServiceAccount, strings.NewReader(tmplRel.Manifest))

in internal/operator-controller/applier/helm.go

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2025
Copy link

netlify bot commented Mar 10, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 9a80b06
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67dc4805469c3500095563b8
😎 Deploy Preview https://deploy-preview-1858--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bentito bentito force-pushed the rbac-auth-k8s-replacer branch from 2991d5d to 65ef8a2 Compare March 10, 2025 20:03
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2025
@bentito bentito marked this pull request as ready for review March 10, 2025 20:04
@bentito bentito requested a review from a team as a code owner March 10, 2025 20:04
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2025
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 32.27273% with 447 lines in your changes missing coverage. Please review.

Project coverage is 64.76%. Comparing base (44ba340) to head (9a80b06).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
hack/tools/k8smaintainer/main.go 0.00% 245 Missing ⚠️
internal/operator-controller/authorization/rbac.go 59.33% 116 Missing and 19 partials ⚠️
internal/operator-controller/applier/helm.go 6.94% 66 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1858      +/-   ##
==========================================
- Coverage   68.94%   64.76%   -4.19%     
==========================================
  Files          66       68       +2     
  Lines        5236     5883     +647     
==========================================
+ Hits         3610     3810     +200     
- Misses       1394     1821     +427     
- Partials      232      252      +20     
Flag Coverage Δ
e2e 46.15% <6.74%> (-4.28%) ⬇️
unit 53.85% <30.60%> (-2.95%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@azych azych left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, will you be adding testing in the authorization package?

}
}

func (a *rbacPreAuthorizer) PreAuthorize(ctx context.Context, manifestManager user.Info, manifestReader io.Reader) (map[string][]rbacv1.PolicyRule, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a short doc comment would be nice, especially about the return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, adding in next commit

gvk := uObj.GroupVersionKind()
restMapping, err := a.restMapper.RESTMapping(gvk.GroupKind(), gvk.Version)
if err != nil {
errs = append(errs, fmt.Errorf("could not get REST mapping for object %d in manifest with GVK %s: %w", i, gvk, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to also try and add name of the object if possible via uObj.GetName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, adding the object’s name should make debugging easier. Adding next commit.

}

func (a *rbacPreAuthorizer) attributesAllowed(ctx context.Context, attributesRecord authorizer.AttributesRecord) (bool, error) {
decision, _, err := a.authorizer.Authorize(ctx, attributesRecord)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second returned value is a 'reason' string. Would that be included in error? If not, maybe it would make sense to either return it or wrap the returned error with it for better context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we should return it, done in next commit

}

for ns, nsMissingRules := range missingRules {
if compactMissingRules, err := validation.CompactRules(nsMissingRules); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that CompactRules cannot return a non nil error right now, but a comment about why we're not handling error might be a good idea for the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding in next commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we shouldn't just go ahead and account for that possibility for future-proofing reasons?

Copy link
Member

@joelanford joelanford Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess the way this works is that if there is an error, then we just stick with the original non-compacted set of missing rules, which seems like a reasonable fallback. I think that is actually the important detail to call out in the comment.

}

func hasAggregationRule(clusterRole *rbacv1.ClusterRole) bool {
return clusterRole.AggregationRule != nil && len(clusterRole.AggregationRule.ClusterRoleSelectors) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(clusterRole.AggregationRule.ClusterRoleSelectors) is 0 would it not mean "match all" in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to try to clarify, does this help? see next commit for comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all mostly copied code from kubernetes (e.g. https://github.com/kubernetes/kubernetes/blob/fd0a3484176f8f29f038b4b28132dd3c7583eb00/pkg/registry/rbac/clusterrole/policybased/storage.go#L124-L126)

Looking at this again, I almost wonder if we could have the escalationChecker simply call this code with a fake storage backend. That way we'd pick up any changes that are made here and avoid maintaining our own copy of it.

missingRules, err := h.PreAuthorizer.PreAuthorize(ctx, &ceServiceAccount, strings.NewReader(tmplRel.Manifest))

var preAuthErrors []error
if len(missingRules) > 0 {
Copy link
Contributor

@azych azych Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missingRules iteration below would not fire if it has no items, so we should be ok with skipping this if

missingRuleDescriptions = append(missingRuleDescriptions, ruleDescription(ns, rule))
}
}
slices.Sort(missingRuleDescriptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if len(preAuthErrors) > 0 {
return nil, "", fmt.Errorf("pre-authorization failed: %v", preAuthErrors)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a fairly self-contained logic, how about extracting a helper function to enclose it and make Apply easier to follow?
Something like:

if features.OperatorControllerFeatureGate.Enabled(features.PreflightPermissions) {
  if _, err := preAuthPreflight(...); err != nil {...}
}

@@ -164,6 +197,34 @@ func (h *Helm) buildHelmChart(bundleFS fs.FS, ext *ocv1.ClusterExtension) (*char
return h.BundleToHelmChartFn(bundleFS, ext.Spec.Namespace, watchNamespace)
}

func (h *Helm) template(ctx context.Context, ext *ocv1.ClusterExtension, chrt *chart.Chart, values chartutil.Values, post postrender.PostRenderer) (*release.Release, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to merge it with getReleaseState somehow? I guess the main thing is i.ClientOnly and the fact that we need to instantiate a separate client for that.
If it doesn't and ends up clumsy, I'd still probably adjust the name to be closer togetReleaseState, since logically it's almost the same

"sigs.k8s.io/controller-runtime/pkg/client"
)

type PreAuthorizer interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could declare this interface on the applier/helm side to decouple it completely from this package as a dependency, but I saw that we're usually declaring interfaces on the producer side, probably because those are then reused, so seems fine as is now too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to add unit tests based on your general comment I think. If so it might one more reason to change where the interface is made, could help with testing...

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025
@trgeiger
Copy link
Contributor

I added some tests but they still need to be tweaked/finalized. I noticed while writing them up that due to the order of the logic where missing rules are checked before escalation, if bind/escalate are in play but we're missing the explicit permissions that bind/escalate would give us we end up with a result where there's no error but we do have missing rules. @joelanford is that what we would want? I would think if we can bind or escalate that we would not return that we're missing those rules since the SA can grant them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move to hack/tools/k8smaintainer/main.go (and move the doc to hack/tools/k8smaintainer/README.md)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make it a separate go module to avoid the golang.org/x/mod dependencies being added to our main module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the move and naming makes sense, but we already have an indirect usage of x/mod in our go.mod on main: golang.org/x/mod v0.24.0 // indirect and we have that after go mod tidy even with the creation of a separate module:

from a run on main:

operator-controller main $ go mod why -m -vendor golang.org/x/mod
# golang.org/x/mod
github.com/operator-framework/operator-controller/hack/ci/custom-linters/analyzers
github.com/operator-framework/operator-controller/hack/ci/custom-linters/analyzers.test
golang.org/x/tools/go/analysis/analysistest
golang.org/x/tools/internal/testenv
golang.org/x/mod/modfile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without a new go module, the changes are made in 7a6a943

// - nil: indicates that the authorization check passed and the operation is permitted.
// - non-nil error: indicates that the authorization failed (either due to insufficient permissions
// or an error encountered during the check), the error provides a slice of several failures at once.
func (a *rbacPreAuthorizer) PreAuthorize(ctx context.Context, manifestManager user.Info, manifestReader io.Reader) (map[string][]rbacv1.PolicyRule, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend avoiding returning a map because of the issues it causes with random iteration order.

WDYT about defining type ScopedPolicyRules struct that includes the namespace and list of policy rules, and then returning []ScopedPolicyRules instead?

Comment on lines +61 to +62
// - non-nil error: indicates that the authorization failed (either due to insufficient permissions
// or an error encountered during the check), the error provides a slice of several failures at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-nil errors are just when there were issues evaluating the permissions, right? If we successfully evaluate permissions and determine that some rules are missing, then we'll return a nil error with some missing rules.

I would also say that the returned error may include multiple failures encountered during rule evaluation (but it also may not, for example if there was a failure decoding the manifest).

joelanford and others added 12 commits March 18, 2025 09:10
This is the manual version. Needed to change rbac.go a bit to allow for v1.32.2 code changes, but basically as-was

Signed-off-by: Brett Tofel <[email protected]>
go.mod made in the current form the tool generates

Signed-off-by: Brett Tofel <[email protected]>
Signed-off-by: Brett Tofel <[email protected]>
Signed-off-by: Brett Tofel <[email protected]>
Signed-off-by: Brett Tofel <[email protected]>
bentito added 2 commits March 18, 2025 09:25
rbac_test.go likely coming soon

Signed-off-by: Brett Tofel <[email protected]>
@bentito bentito force-pushed the rbac-auth-k8s-replacer branch from 7a6a943 to e974006 Compare March 18, 2025 13:44
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2025
@trgeiger trgeiger force-pushed the rbac-auth-k8s-replacer branch from e974006 to 8f76fa8 Compare March 18, 2025 14:29
trgeiger and others added 5 commits March 18, 2025 10:48
Use []ScopedPolicyRules struct for first return value in PreAuthorize()
to avoid issues with random iteration order in previous map return
value.

Signed-off-by: Tayler Geiger <[email protected]>
Name: fmt.Sprintf("system:serviceaccount:%s:%s", ns, saName),
}
// Ensure the manifest contains only allowed rules so that the escalation check succeeds with no missing rules
modifiedManifest := strings.Replace(testManifest, `- apiGroups: ["*"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? The escalation check should pass with the create certificates bit included, that's the verb that we're testing we can escalate for, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is meant to be a happy path test, right? It seems like the dry-run w/ fake storage we're now using is correctly catching that the manifest includes a certificates rule that the service account w/ escalatingClusterRole doesn’t have permission to escalate to create the certificate. I think we were missing what should have been an escalation error before?

This version of the test: uses privilegedClusterRole, new test name, passes:

func TestPreAuthorize_SuccessWithFullPermissions(t *testing.T) {
	t.Run("preauthorize succeeds with full permissions using unchanged manifest", func(t *testing.T) {
		featuregatetesting.SetFeatureGateDuringTest(t, features.OperatorControllerFeatureGate, features.PreflightPermissions, true)
		// Use a client configured with a privileged cluster role that grants all permissions,
		// including certificate creation.
		fakeClient := setupFakeClient(privilegedClusterRole)
		preAuth := NewRBACPreAuthorizer(fakeClient)
		testServiceAccount := user.DefaultInfo{
			Name: fmt.Sprintf("system:serviceaccount:%s:%s", ns, saName),
		}
		// Use the unchanged testManifest which includes the certificates rule.
		missingRules, err := preAuth.PreAuthorize(context.TODO(), &testServiceAccount, strings.NewReader(testManifest))
		require.NoError(t, err)
		require.Equal(t, []ScopedPolicyRules{}, missingRules)
	})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must still be misunderstanding how the escalation works or something because I wrote the test the way it was originally, where the SA does not have the "create" on "certificates" explicitly granted, so we can test if the permissions checking logic results in no missing rules and passes, because as I understood it, if the SA has all the verbs on RBAC objects and is also given bind/escalate, then additional rules like "create" on "certificates" that it doesn't explicitly start with should not cause a failure of the permissions check because it can bind or escalate those. When I ran this test locally in its original form where the SA has bind/escalate but not "create certificates," it passes with no missing rules as I expected.

`
missingRules, err := preAuth.PreAuthorize(context.TODO(), &testServiceAccount, strings.NewReader(manifest))
require.Error(t, err)
// Instead of calling it opaque, we check that the error contains "forbidden:" to indicate it's from storage-layer checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not just a failed escalation check? How is that the storage layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably redo the comments, I just made them quickly to keep it straight in my head where the logic that created the error came from, but that is from the Create with fakestorage, coming from this code, for instance:

https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/rbac/clusterrolebinding/policybased/storage.go#L123

@@ -255,16 +257,20 @@ type decodedManifest struct {
func (dm *decodedManifest) rbacObjects() []client.Object {
objects := make([]client.Object, 0, len(dm.clusterRoles)+len(dm.roles)+len(dm.clusterRoleBindings)+len(dm.roleBindings))
for obj := range maps.Values(dm.clusterRoles) {
objects = append(objects, &obj)
o := obj // avoid aliasing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a problem in 1.22+

Comment on lines 80 to 81
// If there are errors here, you might choose to log them or collect them.
// We'll still try to run the upstream escalation check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to collect and return these errors. This is the equivalent of the SelfSubjectAccessReview. That the check itself fails to even tell us whether we had missing rules, we need to tell the user what happened.

for _, obj := range dm.rbacObjects() {
if err := ec.checkEscalation(ctx, manifestManager, obj); err != nil {
// In Kubernetes 1.32.2 the specialized PrivilegeEscalationError is gone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error type that I added in my copied code so that we could extract the missing rules and separate out the rule evaluation errors.

When we get to phase 2 of this work, we're going to have to do something like that regardless.

Comment on lines 105 to 106
// If escalation check fails, return the detailed missing rules along with the error.
if len(escalationErrors) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missingRules are separate from the escalation check. If there are missingRules, but there are no escalation errors, we still need to report the missing rules.

For example, consider a manifest where the only contents are a secret and configmap. There will never be an escalation there because only RBAC objects can fail escalation checks. If a user lacks permissions on either secrets or configmaps, we need to fail return those and fail the preflight check.

return allMissingPolicyRules, fmt.Errorf("authorization evaluation errors: %w", errors.Join(preAuthEvaluationErrors...))
}
return allMissingPolicyRules, nil
// If the escalation check passed, override any computed missing rules (since the final decision is allowed).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. This is incorrect. The missing rules from authorizeAttributesRecords need to be unioned with the missing rules from the escalation checks.

The fact that all we're getting now is the opaque string-based errors from the escalation check is problematic even now because we can't compute this union, which means we are very likely going to repeat missing rules and make the condition messages longer and more confusing.

Comment on lines 402 to 406
// Merge extra Role rules if present.
key := types.NamespacedName{Namespace: v.Namespace, Name: v.Name}
if extra, ok := ec.extraRoles[key]; ok {
v.Rules = append(v.Rules, extra.Rules...)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place to merge extra rules. The extraRoles and extraClusterRoles need to be used to augment rules returned by the rules resolver when evaluating a role binding or cluster role binding.

The rules resolver only knows about roles and cluster roles that exist in our cache (via KAS and etcd). However, we may have role bindings and cluster role bindings in our manifest that reference other roles and cluster roles in our manifest.

So when ConfirmNoEscalation eventually executes, the rules passed to it need to include the rules that would be bound if the referenced roles or cluster roles did exist.

featuregatetesting.SetFeatureGateDuringTest(t, features.OperatorControllerFeatureGate, features.PreflightPermissions, true)
fakeClient := setupFakeClient(privilegedClusterRole)
preAuth := NewRBACPreAuthorizer(fakeClient)
testServiceAccount := user.DefaultInfo{Name: fmt.Sprintf("system:serviceaccount:%s:%s", ns, saName)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can DRY up these testServiceAccount initializations by making it a global since I think we use the exact same value every time

@trgeiger
Copy link
Contributor

Opened kubernetes/kubernetes#130955 to hopefully get the PrivilegedEscalationError upstreamed in k8s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants