Description
Is this a BUG REPORT or FEATURE REQUEST?:
QUESTION
What happened:
We are running node reaper in our kube cluster for reaping nodes older than 7d for security and compliance reasons. Some of our workloads are ML workloads (Apache Flink jobs) and they run on a specific node group. When we reap a node in that node group, the entire ML job need to get restarted due to the architecture of Flink (same with many ML architectures). So if we reap nodes in the node group one by one (many times a day when nodes are 7d old), the job needs to be restarted many times in a day and it is adding up significant processing lag. As reaping a node older than 7d is same as reaping all nodes in that node group, we are wondering:
- Is there an option in node reaper to wait until all nodes in a specific node group to be older than Nd and reap them all at the same time?
- Is there an option to reap nodes in a specific node group at a specific time window so that we can contain the downtime?
- Any other suggestion to address our scenario?
As noted above we are looking for the advance configuration for few special node groups in addition to the regular options for all other nodes.
Thanks in advance for any help on this.
What you expected to happen:
Option to configure a specific node group differently
How to reproduce it (as minimally and precisely as possible):
N/A
Anything else we need to know?:
N/A
Environment:
- Kubernetes version: v1.23
Other debugging information (if applicable):
- relevant logs: N/A