-
Notifications
You must be signed in to change notification settings - Fork 8
docs: add RFC about CRD naming and policy lifecycle #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
8f690cd
216dd12
66030c2
5b4c2f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,310 @@ | ||||||||||||||
| | | | | ||||||||||||||
| | :----------- | :---------------------------------------------------- | | ||||||||||||||
| | Feature Name | CRD revisit and user workflow | | ||||||||||||||
| | Start Date | 20 Nov 2025 | | ||||||||||||||
| | Category | CRDs | | ||||||||||||||
| | RFC PR | https://github.com/neuvector/runtime-enforcer/pull/45 | | ||||||||||||||
| | State | **ACCEPTED** | | ||||||||||||||
|
|
||||||||||||||
| # Summary | ||||||||||||||
|
|
||||||||||||||
| [summary]: #summary | ||||||||||||||
|
|
||||||||||||||
| This RFC tries to summarize the disccusion happened to far about the policy lifecycle, and tries to also stabilize CRDs in terms of lifecycle, names, and possible interactions. | ||||||||||||||
|
|
||||||||||||||
| # Motivation | ||||||||||||||
|
|
||||||||||||||
| [motivation]: #motivation | ||||||||||||||
|
|
||||||||||||||
| Before implementing a runtime enforcement workflow, in this post-POC phase we want to reach consensus on two topics: | ||||||||||||||
|
|
||||||||||||||
| - Kubernetes' CRD names and specifications | ||||||||||||||
| - The user journey and workflow, especially when not by a UI of some sort | ||||||||||||||
|
|
||||||||||||||
| ## Examples / User Stories | ||||||||||||||
|
|
||||||||||||||
| [examples]: #examples | ||||||||||||||
|
|
||||||||||||||
| The following user stories are to be intended as examples: | ||||||||||||||
|
|
||||||||||||||
| - As a user I want to configure a security policy for a given workload | ||||||||||||||
| - As a user I want the processes that run into my workloads to be learned automatically and be proposed to me | ||||||||||||||
| - As a user I want to inherit the security policy for my workload from a pre-existing template | ||||||||||||||
| - As a user I want to promote a policy proposal to an actual deployed security policy | ||||||||||||||
|
|
||||||||||||||
| # Detailed design | ||||||||||||||
|
|
||||||||||||||
| [design]: #detailed-design | ||||||||||||||
|
|
||||||||||||||
| ## CRDs Overview | ||||||||||||||
| This is a quick overview of all the CRDs we’re going to define. Each one of them is going to be described in depth in the next sections. | ||||||||||||||
|
|
||||||||||||||
| | CRD Current Name | CRD New Name | Description | | ||||||||||||||
| | ------------------------------ | ---------------------- | ------------------------------------------------------------------------------------------------------------------ | | ||||||||||||||
| | WorkloadSecurityPolicyProposal | WorkloadPolicyProposal | Proposed policies learned from workload behavior. Now includes per-container rules. | | ||||||||||||||
| | WorkloadSecurityPolicy | WorkloadPolicy | Defines the enforcement policy (monitor/protect) for a workload, grouping per-container rules or image references. | | ||||||||||||||
| | ClusterWorkloadSecurityPolicy | (Removed) | Replaced by ImagePolicy for cluster-wide reusable profiles. | | ||||||||||||||
| | (New) | ImagePolicy | Defines reusable runtime rules (templates) based on container image, used for policy templating. | | ||||||||||||||
|
flavio marked this conversation as resolved.
|
||||||||||||||
|
|
||||||||||||||
| Changes from the previous version: | ||||||||||||||
| - The WorkloadSecurityPolicy was renamed into WorkloadPolicy | ||||||||||||||
|
|
||||||||||||||
| ## Learning Phase | ||||||||||||||
|
|
||||||||||||||
| During learning mode, we create WorkloadPolicyProposal resources. These resources are structured in this way: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: WorkloadPolicyProposal | ||||||||||||||
| metadata: | ||||||||||||||
| name: deploy-pgsql-8646457455 # <workload_type>-<workload_name> | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit, then name of the deployment is not including numbers, this might seem similar to the random name associated with the underlying pods. I've also changed the resource type to be a StatefulSet, which is a more realistic way to deploy a db.
Suggested change
|
||||||||||||||
| ownerReferences: | ||||||||||||||
| - apiVersion: apps/v1 | ||||||||||||||
| kind: Deployment | ||||||||||||||
| name: pgsql-8646457455 | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| uid: 39a32022-4c8f-424e-a8b6-3c92af3acb2e | ||||||||||||||
| spec: | ||||||||||||||
| rulesByContainer: | ||||||||||||||
| "db-migration": # rules applied to the container named "db-migration" | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /bin/bash | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| "postgres": # rules applied to the container named "postgres" | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| "otel-collector": # rules applied to the container named "otel-collector" | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/otel-collector | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| Changes compared to the current implementation: | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we also dropped the label selector. I think we don't need them anymore, isn't it? If that's the case, please mention that
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is mentioned in the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something, but I was under the impression we wanted to keep the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. During some conversations we opted to make the label mandatory for a first iteration.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's the case. We decided to require the user to enter the "special" label to bind a policy to a Pod. As for the WorkloadPolicyProposal, the example above is not showing the If that's the case (we technically don't need that selector), can you add a line pointing out that |
||||||||||||||
| - The rules section has been replaced by rulesByContainer. This new field holds a map with the name of the containers as key, and the list of the container rules as value. | ||||||||||||||
|
|
||||||||||||||
| Notes on the behavior: | ||||||||||||||
|
|
||||||||||||||
| - The WorkloadPolicyProposal has an ownerReference that ties it back to the workload resource for which the behaviour was observed. | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| - When the observed workload is deleted, the associated WorkloadPolicyProposal is deleted as well. | ||||||||||||||
| - When we switch from a proposal to a real policy we delete the proposal and don’t recreate it again | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A edge case: If we delete the real policy, should we recreate the policy proposal again?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO, If we delete the real policy, I don't think we need to recreate the policy proposal again?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the policy proposal will be automatically recreated once the policy is deleted and we restart the learning phase. Do you want me to specify that?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's useful to be specified, but when talking about the |
||||||||||||||
| - In case of workload rollout, the WorkloadPolicyProposal continues to learn like nothing happened. | ||||||||||||||
|
|
||||||||||||||
| ## The WorkloadPolicy resource | ||||||||||||||
| Policies are defined using the WorkloadPolicy resource. This is how this resource looks: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: WorkloadPolicy | ||||||||||||||
| metadata: | ||||||||||||||
| name: deploy-pgsql-8646457455 | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| namespace: default | ||||||||||||||
| spec: | ||||||||||||||
| mode: monitor # monitor | protect | ||||||||||||||
| rulesByContainer: | ||||||||||||||
| postgres: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| otel-collector: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/otel-collector | ||||||||||||||
| db-migration: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /bin/bash | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| Changes compared to the current implementation: | ||||||||||||||
|
|
||||||||||||||
| - The rules section has been replaced by rulesByContainer. This new field holds a map with the name of the containers as key, and the list of the container rules as value. | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||
|
|
||||||||||||||
| Notes on the behavior: | ||||||||||||||
|
flavio marked this conversation as resolved.
|
||||||||||||||
|
|
||||||||||||||
| - When the enforced workload is deleted, the WorkloadPolicy is still alive; it should be deleted manually | ||||||||||||||
| - In case of workload rollout, the WorkloadPolicy remains unchanged. If it causes issues with the rollout, the user is in charge of rolling back to the previous version or destroying the policy | ||||||||||||||
|
|
||||||||||||||
|
flavio marked this conversation as resolved.
|
||||||||||||||
| ## Binding a WorkloadPolicy | ||||||||||||||
| A workload is protected by a WorkloadPolicy through a podSelector. We suggest the usage of a unique label security.rancher.io/policy, but we don’t enforce it by default since putting it in the spec.template would cause a rollout. | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section is wrong, it's the last proposal that we then revisited during the call. There's no |
||||||||||||||
|
|
||||||||||||||
| - Basic user -> use default k8s workload selectors -> everything works out of the box, no rollout required. | ||||||||||||||
| - Advanced user (real production scenario) -> enforce a unique label on workloads and use this label as a selector -> a rollout could be required if the workload was initially created without the label | ||||||||||||||
|
flavio marked this conversation as resolved.
|
||||||||||||||
|
|
||||||||||||||
| Since the label is not compulsory, we cannot rely on it to understand if a workload is covered or not; we should use a kubectl plugin that scrapes the resources and helps the user to understand the situation (potential conflict, partial workload coverage,...). | ||||||||||||||
|
|
||||||||||||||
| Users can still rely on the unique label if they choose to use it, and so simple kubectl commands. Our kubectl plugin should be generic and also cover cases where the label is not used. | ||||||||||||||
|
|
||||||||||||||
| ## Using the ImagePolicy to inherit rules from pre-made templates | ||||||||||||||
|
holyspectral marked this conversation as resolved.
|
||||||||||||||
|
|
||||||||||||||
| Pods are made by containers, each one of them running a container image. The same container image can be reused by multiple Pods, but its runtime behavior is mostly the same. | ||||||||||||||
|
|
||||||||||||||
| Most of the time, a Redis/Tomcat/NodeJS container image is always going to behave in the same way. There could be some exceptions, we must take that scenario into account. | ||||||||||||||
|
|
||||||||||||||
| Vendors already distribute maintained container images through their platforms. It would make sense to tie our profiles to the container images, rather than thinking about the concept of “workload”. | ||||||||||||||
|
|
||||||||||||||
| Let's define an ImagePolicy: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: ImagePolicy | ||||||||||||||
| metadata: | ||||||||||||||
| name: otel-collector | ||||||||||||||
| spec: | ||||||||||||||
| image: # optional - inspired by SBOMScanner's imageMetadata | ||||||||||||||
| registry: "registry.suse.com" | ||||||||||||||
| repository: "otel-collector" | ||||||||||||||
| tag: "v1.0.0" | ||||||||||||||
| digest: "sha256:1234567890" | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/otel-collector | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| Then it can be consumed by a WorkloadPolicy in this way: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: WorkloadSecurityPolicy | ||||||||||||||
| metadata: | ||||||||||||||
| name: postgres-policy | ||||||||||||||
| namespace: default | ||||||||||||||
| spec: | ||||||||||||||
| mode: monitor # monitor | protect | ||||||||||||||
| rulesByContainer: | ||||||||||||||
| postgres: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| otel-collector: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| imagePolicyRef: otel-collector # name of the ImagePolicy | ||||||||||||||
|
holyspectral marked this conversation as resolved.
holyspectral marked this conversation as resolved.
|
||||||||||||||
| db-migration: | ||||||||||||||
| rules: | ||||||||||||||
| executables: | ||||||||||||||
| allowed: | ||||||||||||||
| - /bin/bash | ||||||||||||||
| - /usr/bin/psql | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| When defining the rules of a container, the user can either define a list of explicit rules or can make a reference to an existing ImagePolicy by using the `imagePolicyRef` attribute. In its first implementation it will not be possible to define both `rules` and `imagePolicyRef` for the same container. | ||||||||||||||
|
|
||||||||||||||
| To avoid uncertainty we must: | ||||||||||||||
|
|
||||||||||||||
| - Introduce a ValidatingWebhook that ensures all the ImagePolicy objects referenced by WorkloadSecurityPolicy exist. The webhook would process CREATE and UPDATE events. | ||||||||||||||
| - Add a finalizer to each ImagePolicy, the deletion of an ImagePolicy resource must be allowed only when no WorkloadSecurityPolicy is referencing it. | ||||||||||||||
|
|
||||||||||||||
| ImagePolicy resources aren't namespaced; they are cluster-wide available resources that can be referenced by any other resource. | ||||||||||||||
|
|
||||||||||||||
| ## Handling Violations in Monitor/Protect Mode | ||||||||||||||
|
|
||||||||||||||
| When a WorkloadPolicy is in monitor or protect mode, the runtime enforcer generates violation notifications (aka processes that are not on the allow list). The difference is that in monitor mode, the violations are still allowed, while in protect mode, they are blocked. | ||||||||||||||
|
|
||||||||||||||
| A notification is sent to the Security Hub in the form of an OpenTelemetry event. | ||||||||||||||
|
|
||||||||||||||
| In this version we are going to create a new CRD related to the tuning aspects of a WorkloadPolicy, that holds the violation data for the policy while the policy is set in **monitor** mode. | ||||||||||||||
|
|
||||||||||||||
| When the policy is in protect mode, the only way of getting a notification about attempted violations will be OpenTelemetry events. | ||||||||||||||
|
|
||||||||||||||
| At the moment, the idea is to use the tuning CRD in order to save space on the WorkloadPolicy one. | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: WorkloadPolicyTuning | ||||||||||||||
| metadata: | ||||||||||||||
| name: postgres-policy | ||||||||||||||
| namespace: default | ||||||||||||||
| spec: | ||||||||||||||
| # ... | ||||||||||||||
| status: | ||||||||||||||
| violations: | ||||||||||||||
| lastObservedTimestamp: "2025-11-14T17:40:00Z" | ||||||||||||||
| totalViolations: 42 | ||||||||||||||
| latestEvents: | ||||||||||||||
| - containerName: postgres | ||||||||||||||
| executable: /usr/bin/wget | ||||||||||||||
| timestamp: "2025-11-14T17:39:50Z" | ||||||||||||||
| - containerName: db-migration | ||||||||||||||
| executable: /bin/sh | ||||||||||||||
| timestamp: "2025-11-14T17:39:55Z" | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| The design is not definitive, but the idea is: | ||||||||||||||
|
|
||||||||||||||
| - Users without the UI will simply update the tuning resource manually if they want to tolerate some violations | ||||||||||||||
| - The rancher extension will use this status to run a kubectl patch with the desired changes based on the user input. | ||||||||||||||
|
|
||||||||||||||
| An alternative design with a map of unique violations could be the following: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| status: | ||||||||||||||
| violations: | ||||||||||||||
| lastObservedTimestamp: "2025-11-14T17:40:00Z" | ||||||||||||||
| totalViolations: 42 | ||||||||||||||
| containerViolations: | ||||||||||||||
| postgres: | ||||||||||||||
| "/usr/bin/wget": | ||||||||||||||
| count: 15 | ||||||||||||||
| lastObservedMode: protect | ||||||||||||||
| lastObservedTimestamp: "2025-11-14T17:39:50Z" | ||||||||||||||
| "/usr/local/bin/curl": | ||||||||||||||
| count: 1 | ||||||||||||||
| lastObservedMode: monitor | ||||||||||||||
| lastObservedTimestamp: "2025-11-14T17:40:00Z" | ||||||||||||||
| db-migration: | ||||||||||||||
| "/bin/sh": | ||||||||||||||
| count: 27 | ||||||||||||||
| lastObservedMode: monitor | ||||||||||||||
| lastObservedTimestamp: "2025-11-14T17:39:55Z" | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| At this stage we don't want to commit on the name of the WorkloadPolicyTuning resource as we might come up with a better name later, and we will for sure revisit at least the naming of the resource. We decided to defer that to a dedicated RFC when we get to implement tuning for policies. | ||||||||||||||
|
|
||||||||||||||
| # Drawbacks | ||||||||||||||
|
|
||||||||||||||
| [drawbacks]: #drawbacks | ||||||||||||||
|
|
||||||||||||||
| We didn't observe any particular drawback in the workflow. Anyway, there are considerations to make: | ||||||||||||||
|
|
||||||||||||||
| - Having rules specified by container will allow us for more granularity and will allow us to support more scenarios (init-containers, sidecars), on the other hand it will have a performance impact that we'll have to measure and document. | ||||||||||||||
|
|
||||||||||||||
| # Alternatives | ||||||||||||||
|
|
||||||||||||||
| [alternatives]: #alternatives | ||||||||||||||
|
|
||||||||||||||
| We considered a bunch of alternatives. For example putting the ImagePolicy and the WorkloadPolicy together: | ||||||||||||||
|
|
||||||||||||||
| ```yaml | ||||||||||||||
| apiVersion: security.rancher.io/v1alpha1 | ||||||||||||||
| kind: WorkloadSecurityPolicy | ||||||||||||||
| metadata: | ||||||||||||||
| name: database | ||||||||||||||
| spec: | ||||||||||||||
| mode: monitor # monitor/protect | ||||||||||||||
| selector: | ||||||||||||||
| matchLabels: | ||||||||||||||
| app: postgres | ||||||||||||||
| policies: | ||||||||||||||
| # ImagePolicy profile to apply to the the container named "db-migration" | ||||||||||||||
| "db-migration": psql-init | ||||||||||||||
| "postgres": psql | ||||||||||||||
| "otel-collector": otel-sidecar | ||||||||||||||
| ``` | ||||||||||||||
|
|
||||||||||||||
| But it didn't work out because this way it becomes very hard to achieve the granularity we wanted, even for a first POC that could resist to time. | ||||||||||||||
|
|
||||||||||||||
| We also tried experimenting with applying annotations to pods referencing directly the ImagePolicy, but didn't lead us to any good-enough conclusion. | ||||||||||||||
|
|
||||||||||||||
| # Unresolved questions | ||||||||||||||
|
|
||||||||||||||
| [unresolved]: #unresolved-questions | ||||||||||||||
|
|
||||||||||||||
| - How do we name the policy tuning CRD? | ||||||||||||||
|
|
||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.