generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Description:
Goal
We propose to build an remediation loop that Node Problem Detector NPD, a Node Readiness Controller , and Descheduler to enable self-healing workloads based on node-level health signals.
This would allow the system to:
- Detect node-level issues via NPD.
- Automatically taint unhealthy nodes.
- Trigger pod eviction/rescheduling via Descheduler for non-tolerating workloads.
Proposed Architecture
-
Detection (NPD)
- NPD runs custom health checks (e.g., some hardware status).
- On failure, it sets a custom
NodeCondition(e.g.,CustomCondition/MyComponentReady=False).
-
Tainting (Node Readiness Controller)
- A
NodeReadinessGateRulewatches the custom condition. - When the condition is not
True, it adds a specific taint (e.g.,readiness.k8s.io/my-component-ready=false:NoSchedule) to the node. - When the condition recovers, the taint is automatically removed.
- A
-
Rescheduling (Descheduler)
- Descheduler runs with the
RemovePodsViolatingNodeTaintsstrategy. - It is configured with
includedTaints: ["readiness.k8s.io/my-component-ready"]to only act on our custom taint. - Pods without a matching toleration are evicted and rescheduled by the default scheduler onto healthy nodes.
- Descheduler runs with the
┌──────────────────────┐ ┌─────────────────────────────┐
│ Node Problem │ │ Node Readiness │
│ Detector (NPD) │ │ Controller (NRC) │
└──────────────────────┘ └─────────────────────────────┘
│ ▲
│ Detects hardware/daemon │ Watches NodeCondition
│ failure & sets condition │
▼ │
┌───────────────────────────────────────────────────────────┐
│ Node Condition: CustomCondition/MyComponentReady=False │
└───────────────────────────────────────────────────────────┘
│
│ Triggers taint logic
▼
┌───────────────────────────────────────────────────────────┐
│ Node Taint: readiness.k8s.io/my-component-ready=false:NoSchedule │
└───────────────────────────────────────────────────────────┘
│
│ Node now unschedulable for non-tolerant pods
▼
┌──────────────────────┐ ┌─────────────────────────────┐
│ Pods on this Node │ │ Descheduler │
│ (without toleration) │◄───┤ • Strategy: │
└──────────────────────┘ │ RemovePodsViolatingNodeTaints │
│ • includedTaints: │
│ ["readiness.k8s.io/my-component-ready"] │
└─────────────────────────────┘
│
│ Evicts violating pods
▼
┌─────────────────────────────┐
│ Kubernetes Scheduler │
│ • Re-schedules evicted pods │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Healthy Nodes │
│ (no matching taint) │
└─────────────────────────────┘Benefits
- Automated recovery: No manual intervention needed for common node-level failures.
- Kubernetes-native: Built entirely on standard APIs (Conditions, Taints/Tolerations).
- Modular & extensible: New health checks can be added by defining new NPD rules + NRC gates.
Request
We’d like to:
- Confirm this integration pattern aligns with the project’s direction.
- Discuss whether support for such workflows should be documented or facilitated (e.g., example configs, Helm values).
- Explore if any enhancements are needed in the current
NodeReadinessGateRuleCRD or controller logic to better support this use case.
Metadata
Metadata
Assignees
Labels
No labels