Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions api/v1alpha1/nodereadinessrule_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ type NodeReadinessRuleSpec struct {
// taint defines the specific Taint (Key, Value, and Effect) to be managed
// on Nodes that meet the defined condition criteria.
//
// Supported effects: NoSchedule, PreferNoSchedule, NoExecute.
// Caution: NoExecute evicts existing pods and can cause significant disruption
// when combined with continuous enforcement mode. Prefer NoSchedule for most use cases.
//
// +required
Taint corev1.Taint `json:"taint,omitempty,omitzero"`

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,10 @@ spec:
description: |-
taint defines the specific Taint (Key, Value, and Effect) to be managed
on Nodes that meet the defined condition criteria.
Supported effects: NoSchedule, PreferNoSchedule, NoExecute.
Caution: NoExecute evicts existing pods and can cause significant disruption
when combined with continuous enforcement mode. Prefer NoSchedule for most use cases.
properties:
effect:
description: |-
Expand Down
2 changes: 1 addition & 1 deletion docs/book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

- [Core Concepts](./user-guide/concepts.md)
- [Installation](./user-guide/installation.md)
<!-- - [Creating Rules](./user-guide/creating-rules.md) -->
- [Getting Started](./user-guide/getting-started.md)
<!-- - [Quickstart](./user-guide/quickstart.md) -- [TODO] How to run the Calico example in KIND cluster -->

# Examples
Expand Down
222 changes: 87 additions & 135 deletions docs/getting-started.md → docs/book/src/user-guide/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@

## Getting Started

This guide covers creating and configuring `NodeReadinessRule` resources.

> **Prerequisites**: Node Readiness Controller must be installed. See [Installation](./installation.md).

### API Spec

#### Example: Storage Readiness Rule (Bootstrap-only)
Expand All @@ -11,17 +15,18 @@ This rule ensures nodes have working storage before removing the storage readine
apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: storage-readiness-rule
name: nfs-storage-readiness-rule
spec:
conditions:
- type: "storage.kubernetes.io/CSIReady"
requiredStatus: "True"
- type: "storage.kubernetes.io/VolumePluginReady"
requiredStatus: "True"
conditions:
- type: "csi.example.net/NodePluginRegistered"
requiredStatus: "True"
- type: "csi.example.net/BackendReachable"
requiredStatus: "True"
- type: "DiskPressure"
requiredStatus: "False"
taint:
key: "readiness.k8s.io/StorageReady"
key: "readiness.k8s.io/vendor.com/nfs-unhealthy"
effect: "NoSchedule"
value: "pending"
enforcementMode: "bootstrap-only"
nodeSelector:
matchLabels:
Expand All @@ -43,82 +48,19 @@ spec:
| `nodeSelector` | Label selector to target specific nodes | No |
| `dryRun` | Preview changes without applying them | No |

### Deployment

#### Option 1: Install official release images

Node-Readiness Controller offers two variants of the container image to support different cluster architectures.

Released container images are available for:
* **x86_64** (AMD64)
* **Arm64** (AArch64)

The controller image is available in the Kubernetes staging registry:

```sh
REPO="us-central1-docker.pkg.dev/k8s-staging-images/node-readiness-controller/node-readiness-controller"

TAG=$(skopeo list-tags docker://$REPO | jq .Tags[-1] | tr -d '"')

docker pull $REPO:$TAG
```

#### Option 2: Deploy Using Make Commands

**Build and push your image to the location specified by `IMG_PREFIX`:`IMG_TAG` :**

```sh
make docker-build docker-push IMG_PREFIX=<some-registry>/nrr-controller IMG_TAG=tag
```

```sh
# Install the CRDs
make install

# Deploy the controller
make deploy IMG_PREFIX=<some-registry>/nrr-controller IMG_TAG=tag

# Create sample rules
kubectl apply -k examples/network-readiness-rule.yaml
```

#### Option 3: Deploy Using Kustomize Directly

```sh
# Install CRDs
kubectl apply -k config/crd

# Deploy controller and RBAC
kubectl apply -k config/default

# Create sample rules
kubectl apply -f examples/network-readiness-rule.yaml
```

### Uninstallation

> **Important**: Follow this order to avoid stuck resources due to finalizers.

The controller adds a finalizer (`readiness.node.x-k8s.io/cleanup-taints`) to each `NodeReadinessRule` to ensure node taints are cleaned up before the rule is deleted. This means you must delete CRs **while the controller is still running**.

```sh
# 1. Delete all rule instances first (while controller is running)
kubectl delete nodereadinessrules --all

# 2. Delete the controller
make undeploy

# 3. Delete the CRDs
make uninstall
```

#### Recovering from Stuck Resources
### Enforcement Modes

If you deleted the controller before removing the CRs, the finalizer will block CR deletion. To recover, manually remove the finalizer:
#### Bootstrap-only Mode
- Removes bootstrap taint when conditions are first satisfied
- Marks completion with node annotation
- Stops monitoring after successful removal (fail-safe)
- Ideal for one-time setup conditions (installing node daemons e.g: security agent or kernel-module update)

```sh
kubectl patch nodereadinessrule <rule-name> -p '{"metadata":{"finalizers":[]}}' --type=merge
```
#### Continuous Mode
- Continuously monitors conditions
- Adds taint when any condition becomes unsatisfied
- Removes taint when all conditions become satisfied
- Ideal for ongoing health monitoring (network connectivity, resource availability)

## Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Operations + Monitoring could go onto its own page including the metrics improvement we handled recently. We could leave this for a follow up #94


Expand Down Expand Up @@ -151,7 +93,7 @@ Test rules safely before applying:
spec:
dryRun: true # Enable dry run mode
conditions:
- type: "storage.kubernetes.io/CSIReady"
- type: "csi.example.net/NodePluginRegistered"
requiredStatus: "True"
# ... rest of spec
```
Expand All @@ -162,28 +104,73 @@ Check dry run results:
kubectl get nodereadinessrule <rule-name> -o jsonpath='{.status.dryRunResults}'
```

### Enforcement Modes
### Rule Validation and Constraints

#### Bootstrap-only Mode
- Removes bootstrap taint when conditions are first satisfied
- Marks completion with node annotation
- Stops monitoring after successful removal (fail-safe)
- Ideal for one-time setup conditions (storage, installing node daemons e.g: security agent or kernel-module update)
#### NoExecute Taint Effect Warning

#### Continuous Mode
- Continuously monitors conditions
- Adds taint when any condition becomes unsatisfied
- Removes taint when all conditions become satisfied
- Ideal for ongoing health monitoring (network connectivity, resource availability)
**`NoExecute` with `continuous` enforcement mode will evict existing workloads when conditions fail.**

## Configuration
If a readiness condition on the node is failing temporarily (eg., the component restarted), all pods without matching tolerations are immediately evicted from the node, if configured with a `NoExecute` taint. Use `NoSchedule` to prevent new scheduling without disrupting running workloads.

The admission webhook warns when using `NoExecute`.
Copy link
Contributor

@ajaysundark ajaysundark Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get a chance to test your changes w/ warnings? You may need to follow this PR: #122 for testing the webhook.

nit / optional: If you get a chance to verify it, you could also add the warning here as response to kubectl.


See [Kubernetes taints documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) for taint behavior details.

#### Avoiding Taint Key Conflicts

The admission webhook prevents multiple rules from using the same `taint.key` and `taint.effect` on overlapping node selectors.

**Example conflict:**
```yaml
# Rule 1
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spec:
spec:
conditions:
- type: "device.gpu-vendor.net/DevicePluginRegistered"
requiredStatus: "True"

let us also include conditions to show the risks / conflicts

conditions:
- type: "device.gpu-vendor.net/DevicePluginRegistered"
requiredStatus: "True"
nodeSelector:
matchLabels:
feature.node.kubernetes.io/pci-10de.present: "true"
taint:
key: "readiness.k8s.io/vendor.com/gpu-not-ready"
effect: "NoSchedule"

# Rule 2 - This will be REJECTED
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spec:
spec:
conditions:
- type: "cniplugin.example.net/rdma/NetworkReady"
requiredStatus: "True"

conditions:
- type: "cniplugin.example.net/rdma/NetworkReady"
requiredStatus: "True"
nodeSelector:
matchLabels:
feature.node.kubernetes.io/pci-10de.present: "true"
taint:
key: "readiness.k8s.io/vendor.com/gpu-not-ready" # Same (taint-key + effect) but different conditions = conflict
effect: "NoSchedule"
```

Use unique, descriptive taint keys for different readiness checks.

#### Taint Key Naming

Follow [Kubernetes naming conventions](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/).

Taint keys must have the `readiness.k8s.io/` prefix to clearly identify readiness-related taints and avoid conflicts with other controllers

### Security
**Valid:**
```yaml
taint:
key: "readiness.k8s.io/vendor.com/network-not-ready"
key: "readiness.k8s.io/vendor.com/gpu-not-ready"
```

The controller requires the following RBAC permissions:
- **Nodes**: `get`, `list`, `watch`, `patch`, `update` (for taint management)
- **NodeReadinessRules**: Full CRUD access
- **Events**: `create` (for status reporting)
**Invalid:**
```yaml
taint:
key: "network-ready" # Missing prefix
key: "node.kubernetes.io/ready" # Wrong prefix
```


## Configuration

### Performance and Scalability

Expand Down Expand Up @@ -211,38 +198,3 @@ conditions:
- type: "readiness.k8s.io/mycompany.example.com/CacheWarmed"
requiredStatus: "True"
```

#### With Cluster Autoscaler
NodeReadinessController work well with cluster autoscaling:
- New nodes start with restrictive taints
- Controller removes taints once conditions are satisfied
- Autoscaler can safely scale knowing nodes are truly ready

## Development

### Building from Source

```sh
# Clone the repository
git clone https://sigs.k8s.io/node-readiness-controller.git
cd node-readiness-controller

# Run tests
make test

# Build binary
make build

# Generate manifests
make manifests
```

### Running Locally

```sh
# Install CRDs
make install

# Run against cluster (requires KUBECONFIG)
make run
```
13 changes: 8 additions & 5 deletions docs/book/src/user-guide/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ Follow this guide to install the Node Readiness Controller in your Kubernetes cl

### Option 1: Install Official Release (Recommended)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could keep this as well


The easiest way to get started is by applying the official release manifests.

First, to install the CRDs, apply the `crds.yaml` manifest:

```sh
Expand All @@ -34,11 +32,16 @@ If it gets evicted during resource pressure, nodes can't transition to Ready sta

This is the priority class used by other critical cluster components (eg: core-dns).

**Images**: The official releases use multi-arch images (AMD64, Arm64).
#### Images

### Option 2: Deploy Using Kustomize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert to existing, let us keep the kustomize option

The official releases use multi-arch images (AMD64, Arm64) and are available at `registry.k8s.io/node-readiness-controller/node-readiness-controller`

If you have cloned the repository and want to deploy from source, you can use Kustomize.
```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have installation flows captured: simple consume from official release images, and using kustomize for advanced setup.

I think documenting 3 different installation paths would just be confusing the reader. Maybe we could hold on to this for now and come back to updating the installation section for #94.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I've simplified the installation method for now, we can re-visit later as per your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could skip this detail

REPO="registry.k8s.io/node-readiness-controller/node-readiness-controller"
TAG=$(skopeo list-tags docker://$REPO | jq .'Tags[-1]' | tr -d '"')
docker pull $REPO:$TAG
```
### Option 2: Deploy Using Kustomize

```sh
# 1. Install Custom Resource Definitions (CRDs)
Expand Down
40 changes: 38 additions & 2 deletions internal/webhook/nodereadinessgaterule_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"context"
"fmt"

corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/validation/field"
Expand Down Expand Up @@ -165,6 +166,35 @@ func (w *NodeReadinessRuleWebhook) nodSelectorsOverlap(selector1, selector2 meta
return sel1.String() == sel2.String()
}

// generateNoExecuteWarnings generates admission warnings for NoExecute taint usage.
// NoExecute taints cause immediate pod eviction, which can be disruptive when
// used with continuous enforcement mode.
func (w *NodeReadinessRuleWebhook) generateNoExecuteWarnings(spec readinessv1alpha1.NodeReadinessRuleSpec) admission.Warnings {
var warnings admission.Warnings

if spec.Taint.Effect != corev1.TaintEffectNoExecute {
return warnings
}

// NoExecute with continuous mode is particularly risky
if spec.EnforcementMode == readinessv1alpha1.EnforcementModeContinuous {
warnings = append(warnings,
"CAUTION: Using NoExecute taint effect with continuous enforcement mode. "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also handling API warnings. Though these are nice to have, there seem to be length restrictions on processing the admission warnings. ref: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#response.

Could you check if this satisfies the criteria?

"This configuration will evict existing pods when conditions fail, which may cause "+
"workload disruption if conditions are unstable. Consider using NoSchedule "+
"effect instead, or bootstrap-only enforcement mode. "+
"See: https://node-readiness-controller.sigs.k8s.io/user-guide/getting-started.html")
} else {
// NoExecute with bootstrap-only is less risky but still worth noting
warnings = append(warnings,
"NOTE: Using NoExecute taint effect. This will evict existing pods that do not "+
"tolerate this taint when applied. Ensure critical system pods have appropriate tolerations. "+
"See: https://node-readiness-controller.sigs.k8s.io/user-guide/getting-started.html")
}

return warnings
}

// SetupWithManager sets up the webhook with the manager.
func (w *NodeReadinessRuleWebhook) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewWebhookManagedBy(mgr).
Expand All @@ -185,7 +215,10 @@ func (w *NodeReadinessRuleWebhook) ValidateCreate(ctx context.Context, obj runti
if allErrs := w.validateNodeReadinessRule(ctx, rule, false); len(allErrs) > 0 {
return nil, fmt.Errorf("validation failed: %v", allErrs)
}
return nil, nil

// Generate warnings for NoExecute taint usage
warnings := w.generateNoExecuteWarnings(rule.Spec)
return warnings, nil
}

func (w *NodeReadinessRuleWebhook) ValidateUpdate(ctx context.Context, oldObj, newObj runtime.Object) (admission.Warnings, error) {
Expand All @@ -197,7 +230,10 @@ func (w *NodeReadinessRuleWebhook) ValidateUpdate(ctx context.Context, oldObj, n
if allErrs := w.validateNodeReadinessRule(ctx, rule, true); len(allErrs) > 0 {
return nil, fmt.Errorf("validation failed: %v", allErrs)
}
return nil, nil

// Generate warnings for NoExecute taint usage
warnings := w.generateNoExecuteWarnings(rule.Spec)
return warnings, nil
}

func (w *NodeReadinessRuleWebhook) ValidateDelete(ctx context.Context, obj runtime.Object) (admission.Warnings, error) {
Expand Down
Loading