feat: Integrate with NPD and Descheduler for taint-based rescheduling


**Description**:

###  Goal
We propose to build an remediation loop that **Node Problem Detector [NPD](https://github.com/kubernetes/node-problem-detector)**, a **[Node Readiness Controller](https://github.com/IBM/node-readiness-controller)** , and **[Descheduler](https://github.com/kubernetes-sigs/descheduler)** to enable self-healing workloads based on node-level health signals.

This would allow the system to:
1. Detect node-level issues via NPD.
2. Automatically taint unhealthy nodes.
3. Trigger pod eviction/rescheduling via Descheduler for non-tolerating workloads.

### Proposed Architecture

1. **Detection (NPD)**  
   - NPD runs custom health checks (e.g., some hardware status).
   - On failure, it sets a custom `NodeCondition` (e.g., `CustomCondition/MyComponentReady=False`).

2. **Tainting (Node Readiness Controller)**  
   - A `NodeReadinessGateRule` watches the custom condition.
   - When the condition is not `True`, it adds a specific taint (e.g., `readiness.k8s.io/my-component-ready=false:NoSchedule`) to the node.
   - When the condition recovers, the taint is automatically removed.

3. **Rescheduling (Descheduler)**  
   - Descheduler runs with the `RemovePodsViolatingNodeTaints` strategy.
   - It is configured with `includedTaints: ["readiness.k8s.io/my-component-ready"]` to **only act on our custom taint**.
   - Pods without a matching toleration are evicted and rescheduled by the default scheduler onto healthy nodes.

```yaml
┌──────────────────────┐    ┌─────────────────────────────┐
│ Node Problem         │    │ Node Readiness              │
│ Detector (NPD)       │    │ Controller (NRC)            │
└──────────────────────┘    └─────────────────────────────┘
            │                              ▲
            │ Detects hardware/daemon      │ Watches NodeCondition
            │ failure & sets condition     │
            ▼                              │
┌───────────────────────────────────────────────────────────┐
│ Node Condition: CustomCondition/MyComponentReady=False    │
└───────────────────────────────────────────────────────────┘
                                            │
                                            │ Triggers taint logic
                                            ▼
┌───────────────────────────────────────────────────────────┐
│ Node Taint: readiness.k8s.io/my-component-ready=false:NoSchedule │
└───────────────────────────────────────────────────────────┘
            │
            │ Node now unschedulable for non-tolerant pods
            ▼
┌──────────────────────┐    ┌─────────────────────────────┐
│ Pods on this Node    │    │ Descheduler                 │
│ (without toleration) │◄───┤ • Strategy:                 │
└──────────────────────┘    │   RemovePodsViolatingNodeTaints │
                            │ • includedTaints:           │
                            │   ["readiness.k8s.io/my-component-ready"] │
                            └─────────────────────────────┘
                                            │
                                            │ Evicts violating pods
                                            ▼
                             ┌─────────────────────────────┐
                             │ Kubernetes Scheduler        │
                             │ • Re-schedules evicted pods │
                             └─────────────────────────────┘
                                            │
                                            ▼
                             ┌─────────────────────────────┐
                             │ Healthy Nodes               │
                             │ (no matching taint)         │
                             └─────────────────────────────┘
```

### Benefits
- **Automated recovery**: No manual intervention needed for common node-level failures.
- **Kubernetes-native**: Built entirely on standard APIs (Conditions, Taints/Tolerations).
- **Modular & extensible**: New health checks can be added by defining new NPD rules + NRC gates.

### Request
We’d like to:
- Confirm this integration pattern aligns with the project’s direction.
- Discuss whether support for such workflows should be documented or facilitated (e.g., example configs, Helm values).
- Explore if any enhancements are needed in the current `NodeReadinessGateRule` CRD or controller logic to better support this use case.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Integrate with NPD and Descheduler for taint-based rescheduling #1

Goal

Proposed Architecture

Benefits

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Integrate with NPD and Descheduler for taint-based rescheduling #1

Description

Goal

Proposed Architecture

Benefits

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions