rancher-sandbox · dottorblaster · Nov 26, 2025 · Nov 20, 2025 · Nov 21, 2025 · Nov 24, 2025
diff --git a/docs/rfc/0004-crds-policy-lifecycle.md b/docs/rfc/0004-crds-policy-lifecycle.md
@@ -0,0 +1,310 @@
+|              |                                                       |
+| :----------- | :---------------------------------------------------- |
+| Feature Name | CRD revisit and user workflow                         |
+| Start Date   | 20 Nov 2025                                           |
+| Category     | CRDs                                                  |
+| RFC PR       | https://github.com/neuvector/runtime-enforcer/pull/45 |
+| State        | **ACCEPTED**                                          |
+
+# Summary
+
+[summary]: #summary
+
+This RFC tries to summarize the disccusion happened to far about the policy lifecycle, and tries to also stabilize CRDs in terms of lifecycle, names, and possible interactions.
+
+# Motivation
+
+[motivation]: #motivation
+
+Before implementing a runtime enforcement workflow, in this post-POC phase we want to reach consensus on two topics:
+
+- Kubernetes' CRD names and specifications
+- The user journey and workflow, especially when not by a UI of some sort
+
+## Examples / User Stories
+
+[examples]: #examples
+
+The following user stories are to be intended as examples:
+
+- As a user I want to configure a security policy for a given workload
+- As a user I want the processes that run into my workloads to be learned automatically and be proposed to me
+- As a user I want to inherit the security policy for my workload from a pre-existing template
+- As a user I want to promote a policy proposal to an actual deployed security policy
+
+# Detailed design
+
+[design]: #detailed-design
+
+## CRDs Overview
+This is a quick overview of all the CRDs we’re going to define. Each one of them is going to be described in depth in the next sections.
+
+| CRD Current Name               | CRD New Name           | Description                                                                                                        |
+| ------------------------------ | ---------------------- | ------------------------------------------------------------------------------------------------------------------ |
+| WorkloadSecurityPolicyProposal | WorkloadPolicyProposal | Proposed policies learned from workload behavior. Now includes per-container rules.                                |
+| WorkloadSecurityPolicy         | WorkloadPolicy         | Defines the enforcement policy (monitor/protect) for a workload, grouping per-container rules or image references. |
+| ClusterWorkloadSecurityPolicy  | (Removed)              | Replaced by ImagePolicy for cluster-wide reusable profiles.                                                        |
+| (New)                          | ImagePolicy            | Defines reusable runtime rules (templates) based on container image, used for policy templating.                   |
+
+Changes from the previous version:
+- The WorkloadSecurityPolicy was renamed into WorkloadPolicy
+
+## Learning Phase
+
+During learning mode, we create WorkloadPolicyProposal resources. These resources are structured in this way:
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: WorkloadPolicyProposal
+metadata:
+  name: deploy-pgsql-8646457455 # <workload_type>-<workload_name>
-  name: deploy-pgsql-8646457455 # <workload_type>-<workload_name>
+metadata:
+  name: statefulsets-pgsql # <workload_type>-<workload_name>
-  name: deploy-pgsql-8646457455 # <workload_type>-<workload_name>
+metadata:
+  name: statefulsets-pgsql # <workload_type>-<workload_name>
+  ownerReferences:
+  - apiVersion: apps/v1
+    kind: Deployment
+    name: pgsql-8646457455
-  - apiVersion: apps/v1
-    kind: Deployment
-    name: pgsql-8646457455
+  - apiVersion: v1
+    kind: StatefulSet
+    name: pgsql
-  - apiVersion: apps/v1
-    kind: Deployment
-    name: pgsql-8646457455
+  - apiVersion: v1
+    kind: StatefulSet
+    name: pgsql
+    uid: 39a32022-4c8f-424e-a8b6-3c92af3acb2e
+spec:
+  rulesByContainer:
+    "db-migration": # rules applied to the container named "db-migration"
+       executables:
+         allowed:
+           - /bin/bash
+           - /usr/bin/psql
+    "postgres": # rules applied to the container named "postgres"
+       executables:
+         allowed:
+           - /usr/bin/psql
+    "otel-collector": # rules applied to the container named "otel-collector"
+       executables:
+         allowed:
+           - /usr/bin/otel-collector
+```
+
+Changes compared to the current implementation:
+- The rules section has been replaced by rulesByContainer. This new field holds a map with the name of the containers as key, and the list of the container rules as value.
+
+Notes on the behavior:
+
+- The WorkloadPolicyProposal has an ownerReference that ties it back to the workload resource for which the behaviour was observed.
- The WorkloadPolicyProposal has an ownerReference that ties it back to the workload resource for which the behaviour was observed.
+- The WorkloadPolicyProposal has an `ownerReference` that ties it back to the workload resource for which the behaviour was observed.
- The WorkloadPolicyProposal has an ownerReference that ties it back to the workload resource for which the behaviour was observed.
+- The WorkloadPolicyProposal has an `ownerReference` that ties it back to the workload resource for which the behaviour was observed.
+- When the observed workload is deleted, the associated WorkloadPolicyProposal is deleted as well.
+- When we switch from a proposal to a real policy we delete the proposal and don’t recreate it again
+- In case of workload rollout, the WorkloadPolicyProposal continues to learn like nothing happened. 
+
+## The WorkloadPolicy resource
+Policies are defined using the WorkloadPolicy resource. This is how this resource looks:
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: WorkloadPolicy
+metadata:
+  name: deploy-pgsql-8646457455
-  name: deploy-pgsql-8646457455
+  name: statefulsets-pgsql
-  name: deploy-pgsql-8646457455
+  name: statefulsets-pgsql
+  namespace: default
+spec:
+  mode: monitor # monitor | protect
+  rulesByContainer:
+    postgres:
+      rules:
+        executables:
+          allowed:
+            - /usr/bin/psql
+    otel-collector: 
+      rules:
+        executables:
+          allowed:
+            - /usr/bin/otel-collector
+    db-migration:
+      rules:
+        executables:
+          allowed:
+            - /bin/bash
+            - /usr/bin/psql
+```
+
+Changes compared to the current implementation:
+
+- The rules section has been replaced by rulesByContainer. This new field holds a map with the name of the containers as key, and the list of the container rules as value.
+
+Notes on the behavior:
+
+- When the enforced workload is deleted, the WorkloadPolicy is still alive; it should be deleted manually
+- In case of workload rollout, the WorkloadPolicy remains unchanged. If it causes issues with the rollout, the user is in charge of rolling back to the previous version or destroying the policy
+
+## Binding a WorkloadPolicy
+A workload is protected by a WorkloadPolicy through a podSelector. We suggest the usage of a unique label security.rancher.io/policy, but we don’t enforce it by default since putting it in the spec.template would cause a rollout.
+
+- Basic user -> use default k8s workload selectors -> everything works out of the box, no rollout required.
+- Advanced user (real production scenario) -> enforce a unique label on workloads and use this label as a selector -> a rollout could be required if the workload was initially created without the label
+
+Since the label is not compulsory, we cannot rely on it to understand if a workload is covered or not; we should use a kubectl plugin that scrapes the resources and helps the user to understand the situation (potential conflict, partial workload coverage,...). 
+
+Users can still rely on the unique label if they choose to use it, and so simple kubectl commands. Our kubectl plugin should be generic and also cover cases where the label is not used.
+
+## Using the ImagePolicy to inherit rules from pre-made templates
+
+Pods are made by containers, each one of them running a container image. The same container image can be reused by multiple Pods, but its runtime behavior is mostly the same.
+
+Most of the time, a Redis/Tomcat/NodeJS container image is always going to behave in the same way. There could be some exceptions, we must take that scenario into account.
+
+Vendors already distribute maintained container images through their platforms. It would make sense to tie our profiles to the container images, rather than thinking about the concept of “workload”.
+
+Let's define an ImagePolicy:
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: ImagePolicy
+metadata:
+  name: otel-collector
+spec:
+  image: # optional - inspired by SBOMScanner's imageMetadata
+    registry: "registry.suse.com"
+    repository: "otel-collector"
+    tag: "v1.0.0"
+    digest: "sha256:1234567890"
+  rules:
+    executables:
+      allowed:
+        - /usr/bin/otel-collector
+```
+
+Then it can be consumed by a WorkloadPolicy in this way:
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: WorkloadSecurityPolicy
+metadata:
+  name: postgres-policy
+  namespace: default
+spec:
+  mode: monitor # monitor | protect
+  rulesByContainer:
+    postgres:
+      rules:
+        executables:
+          allowed:
+            - /usr/bin/psql
+    otel-collector:
+      rules:
+        executables:
+          imagePolicyRef: otel-collector # name of the ImagePolicy
+    db-migration:
+      rules:
+        executables:
+          allowed:
+            - /bin/bash
+            - /usr/bin/psql
+```
+
+When defining the rules of a container, the user can either define a list of explicit rules or can make a reference to an existing ImagePolicy by using the `imagePolicyRef` attribute. In its first implementation it will not be possible to define both `rules` and `imagePolicyRef` for the same container.
+
+To avoid uncertainty we must:
+
+- Introduce a ValidatingWebhook that ensures all the ImagePolicy objects referenced by WorkloadSecurityPolicy exist. The webhook would process CREATE and UPDATE events.
+- Add a finalizer to each ImagePolicy, the deletion of an ImagePolicy resource must be allowed only when no WorkloadSecurityPolicy is referencing it.
+
+ImagePolicy resources aren't namespaced; they are cluster-wide available resources that can be referenced by any other resource.
+
+## Handling Violations in Monitor/Protect Mode
+
+When a WorkloadPolicy is in monitor or protect mode, the runtime enforcer generates violation notifications (aka processes that are not on the allow list). The difference is that in monitor mode, the violations are still allowed, while in protect mode, they are blocked.
+
+A notification is sent to the Security Hub in the form of an OpenTelemetry event.
+
+In this version we are going to create a new CRD related to the tuning aspects of a WorkloadPolicy, that holds the violation data for the policy while the policy is set in **monitor** mode.
+
+When the policy is in protect mode, the only way of getting a notification about attempted violations will be OpenTelemetry events.
+
+At the moment, the idea is to use the tuning CRD in order to save space on the WorkloadPolicy one.
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: WorkloadPolicyTuning
+metadata:
+  name: postgres-policy
+  namespace: default
+spec:
+  # ...
+status:
+  violations:
+    lastObservedTimestamp: "2025-11-14T17:40:00Z"
+    totalViolations: 42
+    latestEvents:
+      - containerName: postgres
+        executable: /usr/bin/wget
+        timestamp: "2025-11-14T17:39:50Z"
+      - containerName: db-migration
+        executable: /bin/sh
+        timestamp: "2025-11-14T17:39:55Z"
+```
+
+The design is not definitive, but the idea is:
+
+- Users without the UI will simply update the tuning resource manually if they want to tolerate some violations
+- The rancher extension will use this status to run a kubectl patch with the desired changes based on the user input.
+
+An alternative design with a map of unique violations could be the following:
+
+```yaml 
+status:
+  violations:
+    lastObservedTimestamp: "2025-11-14T17:40:00Z"
+    totalViolations: 42
+    containerViolations:
+      postgres:
+        "/usr/bin/wget":
+          count: 15
+          lastObservedMode: protect
+          lastObservedTimestamp: "2025-11-14T17:39:50Z"
+        "/usr/local/bin/curl":
+          count: 1
+          lastObservedMode: monitor
+          lastObservedTimestamp: "2025-11-14T17:40:00Z"      
+      db-migration:
+        "/bin/sh":
+          count: 27
+          lastObservedMode: monitor
+          lastObservedTimestamp: "2025-11-14T17:39:55Z"
+```
+
+At this stage we don't want to commit on the name of the WorkloadPolicyTuning resource as we might come up with a better name later, and we will for sure revisit at least the naming of the resource. We decided to defer that to a dedicated RFC when we get to implement tuning for policies.
+
+# Drawbacks
+
+[drawbacks]: #drawbacks
+
+We didn't observe any particular drawback in the workflow. Anyway, there are considerations to make:
+
+- Having rules specified by container will allow us for more granularity and will allow us to support more scenarios (init-containers, sidecars), on the other hand it will have a performance impact that we'll have to measure and document.
+
+# Alternatives
+
+[alternatives]: #alternatives
+
+We considered a bunch of alternatives. For example putting the ImagePolicy and the WorkloadPolicy together:
+
+```yaml
+apiVersion: security.rancher.io/v1alpha1
+kind: WorkloadSecurityPolicy
+metadata:
+  name: database
+spec:
+  mode: monitor # monitor/protect
+  selector:
+    matchLabels:
+      app: postgres
+  policies:
+    # ImagePolicy profile to apply to the the container named "db-migration"
+    "db-migration": psql-init
+    "postgres": psql
+    "otel-collector": otel-sidecar
+```
+
+But it didn't work out because this way it becomes very hard to achieve the granularity we wanted, even for a first POC that could resist to time.
+
+We also tried experimenting with applying annotations to pods referencing directly the ImagePolicy, but didn't lead us to any good-enough conclusion.
+
+# Unresolved questions
+
+[unresolved]: #unresolved-questions
+
+- How do we name the policy tuning CRD?
+