Skip to content

Commit f6c2a43

Browse files
authored
Merge branch 'kubernetes-sigs:main' into main
2 parents c978f55 + 21d133a commit f6c2a43

20 files changed

+572
-27
lines changed

.github/pull_request_template.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
<!-- Thanks for sending a pull request! Here are some tips for you:
2+
3+
- If this is your first time, please read our contributor guidelines: https://git.k8s.io/community/contributors/guide#your-first-contribution and developer guide https://git.k8s.io/community/contributors/devel/development.md#development-guide
4+
- If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
5+
- If the PR is unfinished, see how to mark it: https://git.k8s.io/community/contributors/guide/pull-requests.md#marking-unfinished-pull-requests
6+
-->
7+
8+
9+
## Description
10+
<!-- Brief description of changes -->
11+
12+
## Related Issue
13+
<!--
14+
*Automatically closes linked issue when PR is merged.
15+
Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
16+
17+
Fixes #
18+
19+
or
20+
21+
None
22+
-->
23+
24+
## Type of Change
25+
26+
<!--
27+
Add one of the following kinds:
28+
/kind bug
29+
/kind cleanup
30+
/kind documentation
31+
/kind feature
32+
/kind design
33+
34+
Optionally add one or more of the following kinds if applicable:
35+
/kind api-change
36+
/kind deprecation
37+
/kind failing-test
38+
/kind flake
39+
/kind regression
40+
-->
41+
42+
## Testing
43+
<!-- How was this tested? -->
44+
45+
## Checklist
46+
- [ ] `make test` passes
47+
- [ ] `make lint` passes
48+
49+
## Does this PR introduce a user-facing change?
50+
51+
<!--
52+
If no, just write "NONE" in the release-note block below.
53+
If yes, a release note is required:
54+
1. Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required".
55+
2. Add 'Doc #(issue)' after the block if there is a follow up
56+
For more information on release notes see: https://git.k8s.io/community/contributors/guide/release-notes.md
57+
-->
58+
59+
```release-note
60+
61+
```
62+
Doc #(issue)

config/manager/manager.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ spec:
5555
# operator: In
5656
# values:
5757
# - platform # Modify this value based on your platform node labels.
58+
priorityClassName: system-cluster-critical
5859
nodeSelector:
5960
"node-role.kubernetes.io/control-plane": ""
6061
tolerations:

demo.tape

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,34 @@
11
# Run `vhs demo.tape` from rootdir to recreate demo
22
# Output file
3-
Output docs/demo.gif
3+
# Output docs/demo.gif
4+
5+
# record a video demo
6+
Output demo.mp4
47

58
# Terminal styling
6-
Set FontSize 16
7-
Set Width 1200
8-
Set Height 600
9+
# Set FontSize 16
10+
# Set Width 1200
11+
# Set Height 600
12+
# Set Theme "GitHub Dark"
13+
14+
# optimal dimensions for better YT quality
15+
Set Width 1920
16+
Set Height 1080
17+
Set FontSize 22
918
Set Theme "GitHub Dark"
19+
Set FontFamily "Fira Code"
20+
Set LetterSpacing 0
21+
Set LineHeight 1.2
22+
Set Framerate 30
23+
Set Padding 60
1024

1125
# Typing speed configuration
1226
Set TypingSpeed 35ms
1327

1428
# --- SETUP ---
1529
Hide
30+
Type "export PS1='\[\e[32m\]admin\$\[\e[0m\] '"
31+
Enter
1632
# Expect the kind cluster is already created following docs/TEST_README.md
1733
Type "kubectl config use-context kind-nrr-test"
1834
Enter
@@ -52,17 +68,20 @@ Show
5268
Sleep 2s
5369
Enter
5470
Type@15ms `kubectl get node nrr-test-worker -o custom-columns="NAME:.metadata.name,STATUS:.status.conditions[?(@.type=='network.k8s.io/CalicoReady')].status,TAINTS:.spec.taints[*].key"`
55-
Enter 1
71+
Enter 2
5672
Sleep 3s
5773

5874
# 3. Show the Rule enforcing this
5975
Hide
60-
Enter 2
76+
Type "clear"
77+
Enter
78+
Show
79+
Enter 1
6180
Type "# 2. Why? The NodeReadinessRule requires 'network.k8s.io/CalicoReady'"
6281
Show
6382
Sleep 2s
6483
Enter
65-
Type "cat examples/network-readiness-rule.yaml"
84+
Type "cat examples/cni-readiness/network-readiness-rule.yaml"
6685
Enter 1
6786
Sleep 5s
6887

docs/book/src/SUMMARY.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@
1111

1212
# Examples
1313

14-
<!-- - [Integration Patterns](./examples/integration-patterns.md) -->
15-
<!-- - [CNI Installation](./examples/cni-readiness.md) -->
14+
- [CNI Installation](./examples/cni-readiness.md)
1615
<!-- - [Storage Drivers](./examples/storage-readiness.md) -->
1716
- [Security Agent](./examples/security-agent-readiness.md)
1817
<!-- - [Device Drivers](./examples/dra-readiness.md) -->
@@ -21,18 +20,18 @@
2120

2221
- [Release Notes](./releases.md)
2322

24-
# Operations
23+
<!-- # Operations -->
2524

2625
<!-- - [Monitoring](./operations/monitoring.md) -->
2726
<!-- - [Troubleshooting](./operations/troubleshooting.md) -->
2827
<!-- - [Security](./operations/security.md) -->
2928

30-
# Development
29+
<!-- # Development -->
3130

3231
<!-- - [Architecture](./development/architecture.md) [TODO] high level components involved -->
3332
<!-- - [Testing](./development/testing.md) -- [TODO] Migrate TEST_README.md here -->
3433

35-
# Design
34+
<!-- # Design -->
3635

3736
<!-- - [Controller Internals](./design/controller-internals/overview.md) -->
3837
<!-- - [Node Events](./design/controller-internals/node-events.md) -->
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# CNI Readiness
2+
3+
In many Kubernetes clusters, the CNI plugin runs as a DaemonSet. When a new node joins the cluster, there is a race condition:
4+
1. The Node object is created and marked `Ready` by the Kubelet.
5+
2. The Scheduler sees the node as `Ready` and schedules application pods.
6+
3. However, the CNI DaemonSet might still be initializing networking on that node.
7+
8+
This guide demonstrates how to use the Node Readiness Controller to prevent pods from being scheduled on a node until the Container Network Interface (CNI) plugin (e.g., Calico) is fully initialized and ready.
9+
10+
The high-level steps are:
11+
1. Node is bootstrapped with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/NetworkReady=pending:NoSchedule` immediately upon joining.
12+
2. A sidecar is patched to the cni-agent to monitor the CNI's health and report it to the API server as node-condition (`network.k8s.io/CalicoReady`).
13+
3. Node Readiness Controller will untaint the node only when the CNI reports it is ready.
14+
15+
## Step-by-Step Guide
16+
17+
This example uses **Calico**, but the pattern applies to any CNI.
18+
19+
> **Note**: You can find all the manifests used in this guide in the [`examples/cni-readiness`](https://github.com/kubernetes-sigs/node-readiness-controller/tree/main/examples/cni-readiness) directory.
20+
21+
### 1. Deploy the Readiness Condition Reporter
22+
23+
We need to bridge Calico's internal health status to a Kubernetes Node Condition. We will add a **sidecar container** to the Calico DaemonSet.
24+
25+
This sidecar checks Calico's local health endpoint (`http://localhost:9099/readiness`) and updates a node condition `network.k8s.io/CalicoReady`.
26+
27+
**Patch your Calico DaemonSet:**
28+
29+
```yaml
30+
# cni-patcher-sidecar.yaml
31+
- name: cni-status-patcher
32+
image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
33+
imagePullPolicy: IfNotPresent
34+
env:
35+
- name: NODE_NAME
36+
valueFrom:
37+
fieldRef:
38+
fieldPath: spec.nodeName
39+
- name: CHECK_ENDPOINT
40+
value: "http://localhost:9099/readiness" # update to your CNI health endpoint
41+
- name: CONDITION_TYPE
42+
value: "network.k8s.io/CalicoReady" # update this node condition
43+
- name: CHECK_INTERVAL
44+
value: "15s"
45+
resources:
46+
limits:
47+
cpu: "10m"
48+
memory: "32Mi"
49+
requests:
50+
cpu: "10m"
51+
memory: "32Mi"
52+
```
53+
54+
> Note: In this example, the CNI pod health is monitored by a side-car, so watcher's lifecycle is same as the pod lifecycle.
55+
If the Calico pod is crashlooping, the sidecar will not run and cannot report readiness. For robust 'continuous' readiness reporting, the watcher should be 'external' to the pod.
56+
57+
### 2. Grant Permissions (RBAC)
58+
59+
The sidecar needs permission to update the Node object's status.
60+
61+
```yaml
62+
# calico-rbac-node-status-patch-role.yaml
63+
apiVersion: rbac.authorization.k8s.io/v1
64+
kind: ClusterRole
65+
metadata:
66+
name: node-status-patch-role
67+
rules:
68+
- apiGroups: [""]
69+
resources: ["nodes/status"]
70+
verbs: ["patch", "update"]
71+
---
72+
apiVersion: rbac.authorization.k8s.io/v1
73+
kind: ClusterRoleBinding
74+
metadata:
75+
name: calico-node-status-patch-binding
76+
roleRef:
77+
apiGroup: rbac.authorization.k8s.io
78+
kind: ClusterRole
79+
name: node-status-patch-role
80+
subjects:
81+
# Bind to CNI's ServiceAccount
82+
- kind: ServiceAccount
83+
name: calico-node
84+
namespace: kube-system
85+
```
86+
87+
### 3. Create the Node Readiness Rule
88+
89+
Now define the rule that enforces the requirement. This tells the controller: *"Keep the `readiness.k8s.io/NetworkReady` taint on the node until `network.k8s.io/CalicoReady` is True."*
90+
91+
```yaml
92+
# network-readiness-rule.yaml
93+
apiVersion: readiness.node.x-k8s.io/v1alpha1
94+
kind: NodeReadinessRule
95+
metadata:
96+
name: network-readiness-rule
97+
spec:
98+
# The condition(s) to monitor
99+
conditions:
100+
- type: "network.k8s.io/CalicoReady"
101+
requiredStatus: "True"
102+
103+
# The taint to manage
104+
taint:
105+
key: "readiness.k8s.io/NetworkReady"
106+
effect: "NoSchedule"
107+
value: "pending"
108+
109+
# "bootstrap-only" means: once the CNI is ready once, we stop enforcing.
110+
enforcementMode: "bootstrap-only"
111+
112+
# Update to target only the nodes that need to be protected by this guardrail
113+
nodeSelector:
114+
matchLabels:
115+
node-role.kubernetes.io/worker: ""
116+
```
117+
118+
## Test scripts
119+
120+
1. **Create the Readiness Rule**:
121+
```sh
122+
cd examples/cni-readiness
123+
kubectl apply -f network-readiness-rule.yaml
124+
```
125+
126+
2. **Install Calico CNI and Apply the RBAC**:
127+
```sh
128+
chmod +x apply-calico.sh
129+
sh apply-calico.sh
130+
```
131+
132+
133+
## Verification
134+
135+
To test this, add a new node to the cluster.
136+
137+
1. **Check the Node Taints**:
138+
Immediately upon joining, the node should have the taint:
139+
`readiness.k8s.io/NetworkReady=pending:NoSchedule`.
140+
141+
2. **Check Node Conditions**:
142+
Watch the node conditions. You will initially see `network.k8s.io/CalicoReady` as `False` or missing.
143+
Once Calico starts, the sidecar will update it to `True`.
144+
145+
3. **Check Taint Removal**:
146+
As soon as the condition becomes `True`, the Node Readiness Controller will remove the taint, and workloads will be scheduled.

docs/book/src/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Node Readiness Controller
22

3-
<img style="float: right; margin: auto;" width="180px" src="https://raw.githubusercontent.com/kubernetes-sigs/node-readiness-controller/main/docs/logo/node-readiness-controller-logo.svg"/>
3+
<img style="float: right; margin: auto;" width="180px" src="/logo/node-readiness-controller-logo.svg"/>
44

55
A Kubernetes controller that provides fine-grained, declarative readiness for nodes. It ensures nodes only accept workloads when all required components (e.g., network agents, GPU drivers, storage drivers, or custom health-checks) are fully ready on the node.
66

docs/book/src/logo/node-readiness-controller-logo.svg

Lines changed: 34 additions & 0 deletions
Loading

docs/book/src/user-guide/installation.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,14 @@ kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/re
2626

2727
This will deploy the controller into the `nrr-system` namespace on any available node in your cluster.
2828

29+
#### Controller priority
30+
31+
The controller is deployed with `system-cluster-critical` priority to prevent eviction during node resource pressure.
32+
33+
If it gets evicted during resource pressure, nodes can't transition to Ready state, blocking all workload scheduling cluster-wide.
34+
35+
This is the priority class used by other critical cluster components (eg: core-dns).
36+
2937
**Images**: The official releases use multi-arch images (AMD64, Arm64).
3038

3139
### Option 2: Deploy Using Kustomize

docs/demo/demo.mp4

945 KB
Binary file not shown.

docs/demo/kind_demo.webm

-23.1 MB
Binary file not shown.

0 commit comments

Comments
 (0)