-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3953: Node Resource Hot Plug #3955
base: master
Are you sure you want to change the base?
Conversation
Karthik-K-N
commented
Apr 17, 2023
•
edited
Loading
edited
- One-line PR description: Node Resource Hot Plug
- Issue link: Node Resource Hot Plug #3953
- Other comments:
a7bc843
to
03e927f
Compare
/assign @mrunalp @SergeyKanzhelev @klueska |
/cc |
/cc |
/cc |
Thanks @ffromani , @bart0sh for the inputs, I have updated the KEP with more details. Please take a look when time permits.
As you might be aware there is an ongoing efforts to fetch the machine info via CRI and in future I think it should not be much efforts to adopt. |
Co-authored-by: kishen-v <[email protected]>
92d631d
to
a7b59f4
Compare
Sure, I can't recall which hardware representation model (if we agreed already) is gonna be used in that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Karthik-K-N!
Let me know if you have any questions regarding the NodeSwap KEP.
The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. | ||
Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. | ||
|
||
### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if a node is hotplugged with additional memory, do swap limits get updated? IIUC the answer is no, and this can perhaps serve as a justification for why this approach is needed.
More generally, IIUC restarting kubelet is a workaround in the sense that kubelet doesn't expect a situation where it would spawn on a node with already running containers that need to be modified based on node's resources being changed. Please keep me honest here as I'm not 100% sure that's the case.
The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. | ||
Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. | ||
|
||
### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: is it dangerous to hotplug the node without restarting kubelet?
If it is, then ensuring the kubelet would restart can be considered as a stability enhancement.
Co-authored-by: kishen-v <[email protected]>
Thanks for the PRR update. PRR lgtm for alpha. Separate from PRR, some kind of reaction when resources are removed seems far better than doing nothing. Even if it's to indicate a problem in node status. approving PRR. /approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: deads2k, Karthik-K-N The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Enable the re-initialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation. | ||
* Recalculating the OOMScoreAdj and swap memory limit for existing pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would those two result in a new CRI API calls? Which ones? How we will order these calls when we have many Pods and how we will react on failing calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to use UpdateContainerResources
CRI method for updating both OOMScoreAdj and Swap, Initial plan is to serially update the containers across Pods.
Incase of errors to have retry mechanism like SynchPod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate a bit more details here. If resize failed on a pod, does it mean that we need to fail it? Or notify user somehow? Do we report the size change back to scheduler BEFORE all Pods confirmed the resize or AFTER? What would be implications if scheduler want to use extra resources BEFORE all Pods were actually resized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure definitely, We will update the KEP with more information.
Co-authored-by: kishen-v <[email protected]>
|
||
* Dynamically adjust system reserved and kube reserved values. | ||
* Hot unplug of node resources. | ||
* Update the autoscaler to utilize resource hot plugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Components should not depend on hotplugging being the implementation of Node resize, see my comment in Motivation. In particular IMO Cluster Autoscaler should watch Node changes and observe allocatable changing, that's all it needs to know to make its decisions AFAIK. Same goes for Scheduler?
- Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes. | ||
- Creating new nodes takes more time compared to increasing the capacity of existing nodes. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Components should not depend on hotplugging being the implementation of Node resize, see my comment in Motivation.
For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug. | ||
|
||
Few of the concerns surrounding hotunplug are listed below | ||
* Pod re-admission: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid terminating Pods we could perform Node scale-down in 3 phases:
- Block admission on the Node
- Perform scale-down (hotunplug)
- Unblock admission on the Node
These phases could be performed either manually by an operator or automatically with calls from some control plane (which is out of scope for KEP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, But we may need to evaluate resource availability for already running pods, hence running podReAdmission is necessary. Ref: https://docs.google.com/document/d/1KfjPRmCc8Xk0xxa4S8ZRle6VMzc1C6MQg4ivOaoB150/edit?disco=AAAAt7P2DTA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that the scale-down is something kubernetes knows about before it happens, right? How does it come to learn about an imminent scale event?
And even then, we still need to define what happens when you get a surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, eviction is a topic in itself, especially being able to assess what would be an impact on affected workflows and how to migrate pods with as little impact as possible. But all of it is well described in KEP-4563 (#4565).
|
||
### Non-Goals | ||
|
||
* Dynamically adjust system reserved and kube reserved values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we afford not adjusting reserved resources? If we increase Node's CPU/memory significantly, we allow to run way more Pods and reinitialize existing components to detect more resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some background about have it in non-goal: https://docs.google.com/document/d/1KfjPRmCc8Xk0xxa4S8ZRle6VMzc1C6MQg4ivOaoB150/edit?disco=AAABVzcLYu4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's in-scope to be very clear that these values CAN CHANGE (so consumers know), but maybe out-of-scope on exactly if/how kubelet allows changing them on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With our exploration we identified that the ratio can vary across different providers or flavors of Kubernetes offerings. For example in GKE its calculated as https://cloud.google.com/kubernetes-engine/docs/concepts/plan-node-sizes
Is it suggested standardize this formula and may be override them later by the downstream controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of agree that theses values are set and calculated by individual vendors/clouds today. It's not so meaningful to include some standard formula and adjust them automatically.
But we can discuss how someone can change this on-the-fly if needed to, and maybe provide some knobs to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can discuss how someone can change this on-the-fly if needed to, and maybe provide some knobs to do so.
Sure, I think this can be considered as an extension to this KEP and will add it in the Future Work.
In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, | ||
necessitating supplementary resources within the cluster. | ||
|
||
Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (for example: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried Node resizing with hot plug/unplug in GKE and noticed some components don't handle onlining of previously offlined CPUs well, e.g. Cilium not listening to perf events on the newly onlined CPUs. We also found other libraries that could be affected. We later switched to cgroups limiting on the guest + VMM-level throtling as a more reliable alternative.
I treat Node resizing as still a research space so I want to make sure there's an API for telling Kubelet what should be the current resources, as opposed to it discovering it on its own. It's what we discussed in the past with a new API for NRI plugins. This way hotplugging could be the default implementation, but replaceable with other solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, May be once we have some concrete resource API for kubelet , We can consider switching over to an option of swapping alternatives in the future if required without disrupting the overall system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. Cilium not listening to perf events on the newly onlined CPUs
Not blaming Cilium here, it's easy to assume something never changes when it has never actually changed. Even though we all KNOW that cpu hotplug is a thing. I do think we should apply pressure to these components to do proper fixes.
``` | ||
|
||
The interaction sequence is as follows | ||
1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that with an external signal about the current VM size coming from NRI Kubelet could maybe react within ~1s as opposed to the current 5 minutes of cadvisor cache refresh (see my comment in Motivation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but cAdvisor is currently using default poll interval of 5 min, which can be customized through update_machine_info_interval
for aggressive polling. On the similar lines, if needed we can customize poling interval in kubelet through a flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define an interface and assert a desired SLO. If that means polling more frequently or watching a different kernel mechanism, that goes BEHIND the interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we will explore and update as per your suggestion #3955 (comment)
Co-authored-by: kishen-v <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this from a high level, but given the timeframe it seems unlikely to make theis KEP freeze -- how bad is that?
In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, | ||
necessitating supplementary resources within the cluster. | ||
|
||
Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (for example: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. Cilium not listening to perf events on the newly onlined CPUs
Not blaming Cilium here, it's easy to assume something never changes when it has never actually changed. Even though we all KNOW that cpu hotplug is a thing. I do think we should apply pressure to these components to do proper fixes.
|
||
### Non-Goals | ||
|
||
* Dynamically adjust system reserved and kube reserved values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's in-scope to be very clear that these values CAN CHANGE (so consumers know), but maybe out-of-scope on exactly if/how kubelet allows changing them on the fly?
### Non-Goals | ||
|
||
* Dynamically adjust system reserved and kube reserved values. | ||
* Hot unplug of node resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be removed from non-goals, then?
|
||
* Dynamically adjust system reserved and kube reserved values. | ||
* Hot unplug of node resources. | ||
* Update the autoscaler to utilize resource hot plugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Depend on" and "take advantage of" are different.
Should VPA know whether a node is vertically scalable? If it knew that, might it make better decisions? What SLOs would be needed to make it useful?
Or maybe that's just too complicated? I'm assuming that some smart, multi-dmensional scaling HAS to emerge, so it can decide between adding replicas or scaling-up pods (or some mix thereof). With IPPR it can really only look at the node capacity to make that decision. If as node could express "I have 12 cores, but I could have up to 32 with an 86% chance of success", would that be useful?
|
||
## Proposal | ||
|
||
This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a design POV, why is polling the way to go? With the luxury of distance, I would think something like this would be simpler:
Define an interface which calls a callback function when resources change.
Implement said interface in terms of cadvisor (which might poll underneath).
Consume those callback in kubelet (possibly through a queue and a periodic handler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thank you for the idea. So inclined to polling in Kubelet as it was existing design in cAdvisor.
We will surely explore this path as it can be managed better for rate limiting if needed.
As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this topic.
Why NOT restart kubelet? It's certainly going to be less complicated, technically.
Is it too slow? Is thsat something we should fix anyway? Having kubelet restarts be reliably fast and safe is a win.
Would that only be a partial solution (e.g. kubelet plugins have the same problem)?
etc.
I am not saying "do it with a restart" but "convince me that a restart is bad or insufficient"
- Lack of coordination about change in resource availability across kubelet/runtime/plugins. | ||
- The plugins/runtime should be updated to react to change in resource information on the node. | ||
|
||
- Kubelet missing hotplug event or too many hotplug events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always be level triggered. Events are great, but frequent re-validation is where it's at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, Intention was to mention kubelet failing to react/ignoring to any hot plug instances. So will rephrase accordingly to mention its level triggered.
|
||
### Notes/Constraints/Caveats (Optional) | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the risk that bringing new NAMED resources (as opposed to aggregate resources) on-line can trip up naive consumers.
- Anything that spawns per-CPU things (threads, workpools, etc)
- Anything that is aware of relationships (e.g NUMA)
- Hot-plugging other devices (e.g. GPUs) might trip up applications or libraries (e.g. is CUDA/NCCL ready for this?)
``` | ||
|
||
The interaction sequence is as follows | ||
1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define an interface and assert a desired SLO. If that means polling more frequently or watching a different kernel mechanism, that goes BEHIND the interface
/cc |
### Goals | ||
|
||
* Achieve seamless node capacity expansion through hot plugging resources. | ||
* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some existing pods have been pinned to the old remaining CPU set, would the re-init update them after new CPU cores appear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, hotplug only supplements additional CPUs rather than re-assigning new set of reuired CPUs.
- https://github.com/kubernetes/kubernetes/issues/125579 | ||
- https://github.com/kubernetes/kubernetes/issues/127793 | ||
|
||
Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same. | |
Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome. |
This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically. | ||
The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. | ||
Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. | ||
With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj. | |
With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries a small overhead due to recalculation of swap and OOMScoreAdj. |
* Node resource information before and after resource hot plug for the following scenarios. | ||
* upsize -> downsize | ||
* upsize -> downsize -> upsize | ||
* downsize- > upsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* downsize- > upsize | |
* downsize -> upsize |
* Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. | ||
* Handling unplug of reserved CPUs. | ||
|
||
we intend to propose a separate KEP dedicated to hotunplug of resources to address the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we intend to propose a separate KEP dedicated to hotunplug of resources to address the same. | |
We intend to propose a separate KEP dedicated to hotunplug of resources to address that. |
In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain | ||
in the pending state only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain | |
in the pending state only. | |
In case of rollout failures, running workloads are not affected. If pods are in the pending state, they remain pending. |
In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, | ||
necessitating supplementary resources within the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, | |
necessitating supplementary resources within the cluster. | |
In a conventional Kubernetes environment, cluster resources might need modification because of inaccurate resource allocation or due to escalating workloads over time, requiring supplementary resources within the cluster. |
<!-- | ||
Even if applying deprecation policies, they may still surprise some users. | ||
--> | ||
No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we add a flag to control the poll duration about kubelet reading new machine info from cAdvisor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flag, possibly not. Did you mean "field within the kubelet configuration file?"
(using command line arguments to kubelet is mostly deprecated in favor of using configuration fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested in comment , we have decided to move away from poll mechanism to level triggered mechanism, Where we continuously watch for resize changes from cAdvisor and handle them accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming from the ping on SIG-Scheduling slack and take a glance at it.
Overall I don't see any mention of what changes are needed for which part of the scheduler for this proposal. Do you mean we don't need any change on the scheduler because node's status is updated and the scheduler watches it? (If Yes, can you update the KEP to clarify that point?)
Also, it only mention "With increase in cluster resources" scenario? Would it be impossible to decrease the resources?
EDIT: looks like that's out of scope of this KEP.
https://kubernetes.slack.com/archives/C09TP78DV/p1739505791133219?thread_ts=1739425555.326269&cid=C09TP78DV
When we graduate to beta, and on Linux, we could consider a test along the lines of:
|
Co-authored-by: kishen-v <[email protected]>