kubernetes-sigs
diff --git a/‎.github/pull_request_template.md‎
Lines changed: 62 additions & 0 deletions b/‎.github/pull_request_template.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎config/manager/manager.yaml‎
Lines changed: 1 addition & 0 deletions b/‎config/manager/manager.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎demo.tape‎
Lines changed: 26 additions & 7 deletions b/‎demo.tape‎
Lines changed: 26 additions & 7 deletions
diff --git a/‎docs/book/src/SUMMARY.md‎
Lines changed: 4 additions & 5 deletions b/‎docs/book/src/SUMMARY.md‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎docs/book/src/examples/cni-readiness.md‎
Lines changed: 146 additions & 0 deletions b/‎docs/book/src/examples/cni-readiness.md‎
Lines changed: 146 additions & 0 deletions
diff --git a/‎docs/book/src/introduction.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/book/src/introduction.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/book/src/logo/node-readiness-controller-logo.svg‎
Lines changed: 34 additions & 0 deletions b/‎docs/book/src/logo/node-readiness-controller-logo.svg‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎docs/book/src/user-guide/installation.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/book/src/user-guide/installation.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/demo/demo.mp4‎
945 KB b/‎docs/demo/demo.mp4‎
945 KB
diff --git a/‎docs/demo/kind_demo.webm‎
-23.1 MB b/‎docs/demo/kind_demo.webm‎
-23.1 MB
@@ -0,0 +1,62 @@
+<!--  Thanks for sending a pull request!  Here are some tips for you:
+
+- If this is your first time, please read our contributor guidelines: https://git.k8s.io/community/contributors/guide#your-first-contribution and developer guide https://git.k8s.io/community/contributors/devel/development.md#development-guide
+- If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
+- If the PR is unfinished, see how to mark it: https://git.k8s.io/community/contributors/guide/pull-requests.md#marking-unfinished-pull-requests
+-->
+
+
+## Description
+<!-- Brief description of changes -->
+
+## Related Issue
+<!--
+*Automatically closes linked issue when PR is merged.
+Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
+
+Fixes #
+
+or
+
+None
+-->
+
+## Type of Change
+
+<!--
+Add one of the following kinds:
+/kind bug
+/kind cleanup
+/kind documentation
+/kind feature
+/kind design
+
+Optionally add one or more of the following kinds if applicable:
+/kind api-change
+/kind deprecation
+/kind failing-test
+/kind flake
+/kind regression
+-->
+
+## Testing
+<!-- How was this tested? -->
+
+## Checklist
+- [ ] `make test` passes
+- [ ] `make lint` passes
+
+## Does this PR introduce a user-facing change?
+
+<!--
+If no, just write "NONE" in the release-note block below.
+If yes, a release note is required:
+  1. Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required".
+  2. Add 'Doc #(issue)' after the block if there is a follow up
+For more information on release notes see: https://git.k8s.io/community/contributors/guide/release-notes.md
+-->
+
+```release-note
+
+```
+Doc #(issue)
@@ -55,6 +55,7 @@ spec:
       #             operator: In
       #             values:
       #               - platform    # Modify this value based on your platform node labels.
+      priorityClassName: system-cluster-critical
       nodeSelector:
         "node-role.kubernetes.io/control-plane": ""
       tolerations:
 
@@ -1,18 +1,34 @@
 # Run `vhs demo.tape` from rootdir to recreate demo
 # Output file
-Output docs/demo.gif
+# Output docs/demo.gif
+
+# record a video demo
+Output demo.mp4
 
 # Terminal styling
-Set FontSize 16
-Set Width 1200
-Set Height 600
+# Set FontSize 16
+# Set Width 1200
+# Set Height 600
+# Set Theme "GitHub Dark"
+
+# optimal dimensions for better YT quality
+Set Width 1920
+Set Height 1080
+Set FontSize 22
 Set Theme "GitHub Dark"
+Set FontFamily "Fira Code"
+Set LetterSpacing 0
+Set LineHeight 1.2
+Set Framerate 30
+Set Padding 60
 
 # Typing speed configuration
 Set TypingSpeed 35ms
 
 # --- SETUP ---
 Hide
+Type "export PS1='\[\e[32m\]admin\$\[\e[0m\] '"
+Enter
 # Expect the kind cluster is already created following docs/TEST_README.md
 Type "kubectl config use-context kind-nrr-test"
 Enter
@@ -52,17 +68,20 @@ Show
 Sleep 2s
 Enter
 Type@15ms `kubectl get node nrr-test-worker -o custom-columns="NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='network.k8s.io/CalicoReady')].status,TAINTS:.spec.taints[*].key"`
-Enter 1
+Enter 2
 Sleep 3s
 
 # 3. Show the Rule enforcing this
 Hide
-Enter 2
+Type "clear"
+Enter
+Show
+Enter 1
 Type "# 2. Why? The NodeReadinessRule requires 'network.k8s.io/CalicoReady'"
 Show
 Sleep 2s
 Enter
-Type "cat examples/network-readiness-rule.yaml"
+Type "cat examples/cni-readiness/network-readiness-rule.yaml"
 Enter 1
 Sleep 5s
 
 
@@ -11,8 +11,7 @@
 
 # Examples
 
-<!-- - [Integration Patterns](./examples/integration-patterns.md) -->
-<!-- - [CNI Installation](./examples/cni-readiness.md) -->
+- [CNI Installation](./examples/cni-readiness.md)
 <!-- - [Storage Drivers](./examples/storage-readiness.md) -->
 - [Security Agent](./examples/security-agent-readiness.md)
 <!-- - [Device Drivers](./examples/dra-readiness.md) -->
@@ -21,18 +20,18 @@
 
 - [Release Notes](./releases.md)
 
-# Operations
+<!-- # Operations -->
 
 <!-- - [Monitoring](./operations/monitoring.md) -->
 <!-- - [Troubleshooting](./operations/troubleshooting.md) -->
 <!-- - [Security](./operations/security.md) -->
 
-# Development
+<!-- # Development -->
 
 <!-- - [Architecture](./development/architecture.md) [TODO] high level components involved -->
 <!-- - [Testing](./development/testing.md) --  [TODO] Migrate TEST_README.md here -->
 
-# Design
+<!-- # Design -->
 
 <!--  - [Controller Internals](./design/controller-internals/overview.md) -->
 <!--  - [Node Events](./design/controller-internals/node-events.md) -->
 
@@ -0,0 +1,146 @@
+# CNI Readiness
+
+In many Kubernetes clusters, the CNI plugin runs as a DaemonSet. When a new node joins the cluster, there is a race condition:
+1.  The Node object is created and marked `Ready` by the Kubelet.
+2.  The Scheduler sees the node as `Ready` and schedules application pods.
+3.  However, the CNI DaemonSet might still be initializing networking on that node.
+
+This guide demonstrates how to use the Node Readiness Controller to prevent pods from being scheduled on a node until the Container Network Interface (CNI) plugin (e.g., Calico) is fully initialized and ready.
+
+The high-level steps are:
+1.  Node is bootstrapped with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/NetworkReady=pending:NoSchedule` immediately upon joining.
+2.  A sidecar is patched to the cni-agent to monitor the CNI's health and report it to the API server as node-condition (`network.k8s.io/CalicoReady`). 
+3. Node Readiness Controller will untaint the node only when the CNI reports it is ready.
+
+## Step-by-Step Guide
+
+This example uses **Calico**, but the pattern applies to any CNI.
+
+> **Note**: You can find all the manifests used in this guide in the [`examples/cni-readiness`](https://github.com/kubernetes-sigs/node-readiness-controller/tree/main/examples/cni-readiness) directory.
+
+### 1. Deploy the Readiness Condition Reporter
+
+We need to bridge Calico's internal health status to a Kubernetes Node Condition. We will add a **sidecar container** to the Calico DaemonSet.
+
+This sidecar checks Calico's local health endpoint (`http://localhost:9099/readiness`) and updates a node condition `network.k8s.io/CalicoReady`.
+
+**Patch your Calico DaemonSet:**
+
+```yaml
+# cni-patcher-sidecar.yaml
+- name: cni-status-patcher
+  image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
+  imagePullPolicy: IfNotPresent
+  env:
+    - name: NODE_NAME
+      valueFrom:
+        fieldRef:
+          fieldPath: spec.nodeName
+    - name: CHECK_ENDPOINT
+      value: "http://localhost:9099/readiness" # update to your CNI health endpoint
+    - name: CONDITION_TYPE
+      value: "network.k8s.io/CalicoReady"     # update this node condition
+    - name: CHECK_INTERVAL
+      value: "15s"
+  resources:
+    limits:
+      cpu: "10m"
+      memory: "32Mi"
+    requests:
+      cpu: "10m"
+      memory: "32Mi"
+```
+
+  > Note: In this example, the CNI pod health is monitored by a side-car, so watcher's lifecycle is same as the pod lifecycle.
+  If the Calico pod is crashlooping, the sidecar will not run and cannot report readiness. For robust 'continuous' readiness reporting, the watcher should be 'external' to the pod.
+
+### 2. Grant Permissions (RBAC)
+
+The sidecar needs permission to update the Node object's status.
+
+```yaml
+# calico-rbac-node-status-patch-role.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: node-status-patch-role
+rules:
+- apiGroups: [""]
+  resources: ["nodes/status"]
+  verbs: ["patch", "update"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: calico-node-status-patch-binding
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: node-status-patch-role
+subjects:
+# Bind to CNI's ServiceAccount
+- kind: ServiceAccount
+  name: calico-node
+  namespace: kube-system
+```
+
+### 3. Create the Node Readiness Rule
+
+Now define the rule that enforces the requirement. This tells the controller: *"Keep the `readiness.k8s.io/NetworkReady` taint on the node until `network.k8s.io/CalicoReady` is True."*
+
+```yaml
+# network-readiness-rule.yaml
+apiVersion: readiness.node.x-k8s.io/v1alpha1
+kind: NodeReadinessRule
+metadata:
+  name: network-readiness-rule
+spec:
+  # The condition(s) to monitor
+  conditions:
+    - type: "network.k8s.io/CalicoReady"
+      requiredStatus: "True"
+  
+  # The taint to manage
+  taint:
+    key: "readiness.k8s.io/NetworkReady"
+    effect: "NoSchedule"
+    value: "pending"
+  
+  # "bootstrap-only" means: once the CNI is ready once, we stop enforcing.
+  enforcementMode: "bootstrap-only"
+  
+  # Update to target only the nodes that need to be protected by this guardrail
+  nodeSelector:
+    matchLabels:
+      node-role.kubernetes.io/worker: ""
+```
+
+## Test scripts
+
+1.  **Create the Readiness Rule**:
+    ```sh
+    cd examples/cni-readiness
+    kubectl apply -f network-readiness-rule.yaml
+    ```
+
+2.  **Install Calico CNI and Apply the RBAC**:
+    ```sh
+    chmod +x apply-calico.sh
+    sh apply-calico.sh
+    ```
+
+
+## Verification
+
+To test this, add a new node to the cluster.
+
+1.  **Check the Node Taints**:
+    Immediately upon joining, the node should have the taint:
+    `readiness.k8s.io/NetworkReady=pending:NoSchedule`.
+
+2.  **Check Node Conditions**:
+    Watch the node conditions. You will initially see `network.k8s.io/CalicoReady` as `False` or missing.
+    Once Calico starts, the sidecar will update it to `True`.
+
+3.  **Check Taint Removal**:
+    As soon as the condition becomes `True`, the Node Readiness Controller will remove the taint, and workloads will be scheduled.
@@ -1,6 +1,6 @@
 # Node Readiness Controller
 
-<img style="float: right; margin: auto;" width="180px" src="https://raw.githubusercontent.com/kubernetes-sigs/node-readiness-controller/main/docs/logo/node-readiness-controller-logo.svg"/>
+<img style="float: right; margin: auto;" width="180px" src="/logo/node-readiness-controller-logo.svg"/>
 
 A Kubernetes controller that provides fine-grained, declarative readiness for nodes. It ensures nodes only accept workloads when all required components (e.g., network agents, GPU drivers, storage drivers, or custom health-checks) are fully ready on the node.
 
 
@@ -26,6 +26,14 @@ kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/re
 
 This will deploy the controller into the `nrr-system` namespace on any available node in your cluster.
 
+#### Controller priority
+
+The controller is deployed with `system-cluster-critical` priority to prevent eviction during node resource pressure.
+
+If it gets evicted during resource pressure, nodes can't transition to Ready state, blocking all workload scheduling cluster-wide.
+
+This is the priority class used by other critical cluster components (eg: core-dns).
+
 **Images**: The official releases use multi-arch images (AMD64, Arm64).
 
 ### Option 2: Deploy Using Kustomize