diff --git a/keps/prod-readiness/sig-node/3953.yaml b/keps/prod-readiness/sig-node/3953.yaml new file mode 100644 index 00000000000..7416eaf941f --- /dev/null +++ b/keps/prod-readiness/sig-node/3953.yaml @@ -0,0 +1,3 @@ +kep-number: 3953 +alpha: + approver: "@deads2k" diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md new file mode 100644 index 00000000000..fabb53bb311 --- /dev/null +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -0,0 +1,927 @@ +# KEP-3953: Node Resource Hot Plug + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Glossary](#glossary) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Story 5](#story-5) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Handling hotplug events](#handling-hotplug-events) + - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers) + - [Handling HotUnplug Events](#handling-hotunplug-events) + - [Flow Control](#flow-control) + - [Test Plan](#test-plan) + - [Unit tests](#unit-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Phase 1: Alpha (target 1.34)](#phase-1-alpha-target-134) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Upgrade](#upgrade) + - [Downgrade](#downgrade) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) +- [Future Work](#future-work) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Glossary + +Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) + +Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) + + +## Summary + +The proposal seeks to facilitate hot plugging of node compute resources(CPU, Memory, Swap Capacity and HugePages), thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster. +The revised node configurations will be automatically propagated at both the node and cluster levels. + +Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations and +aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations. + +## Motivation +Currently, the node's resource configurations are recorded solely during the kubelet bootstrap phase and subsequently cached, assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle. +In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, +necessitating supplementary resources within the cluster. + +Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (for example: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). +This can be across different architecture and compute environments like Cloud, Bare metal or VM. During this exercise it can lead to Kubernetes being unaware of the node's altered compute capacities during a live-resize, +causing the node to retain outdated information and leading to inconsistencies or an imbalance in the cluster, thus affecting the optimal scheduling and deployment of workloads. As a side-effect, it is also possible for the workloads +to be force migrated to a different node, causing a temporary spike in the CPU/Memory utilisation which is undesirable. + +With the current state of implementation in the Kubernetes realm, the available workarounds to allow the cluster to be aware of the changes in the cluster's capacity is by +restarting the node or at-least restarting the kubelet, which does not have a certain set of best-practices to follow. + +However, this approach does carry a few drawbacks such as + - Introducing a downtime for the existing/to-be-scheduled workloads on the cluster until the node is available. + - Necessity to reconfigure the underlying services post node-reboot. + - Managing the associated nuances that a kubelet restart or node reboot carries such as + - https://github.com/kubernetes/kubernetes/issues/109595 + - https://github.com/kubernetes/kubernetes/issues/119645 + - https://github.com/kubernetes/kubernetes/issues/125579 + - https://github.com/kubernetes/kubernetes/issues/127793 + +Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same. + +Also, given that the capability to live-resize a node exists in the Linux and Windows kernels, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made +by the Kubernetes administrator. + +Node resource hot plugging proves advantageous in scenarios such as: +- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones. +- The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes. +- Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node. +- Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane +- Mitigate a few of the existing limitations/issues that are associated with a node/kubelet restart. + +Implementing this KEP will empower nodes to recognize and adapt to changes in their compute configurations and allow facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands. + +### Goals + +* Achieve seamless node capacity expansion through hot plugging resources. +* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation. +* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods. + +### Non-Goals + +* Dynamically adjust system reserved and kube reserved values. +* Hot unplug of node resources. +* Update the autoscaler to utilize resource hot plugging. +* Re-balance workloads across the nodes. +* Update runtime/NRI plugins with host resource changes. + +## Proposal + +This KEP strives to enable node resource hot plugging by making the kubelet to watch and retrieve machine resource information from cAdvisor's cache as and when it changes, cAdvisor's cache is already updated periodically. +The kubelet will fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. +Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. +With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj. + +### User Stories + +#### Story 1 + +As a Kubernetes user, I want to allocate more resources (CPU, memory) to a node with existing specialized hardware or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) +so that additional workloads can leverage the hardware to be efficiently scheduled and run without manual intervention. + +#### Story 2 + +As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are +Fewer Context Switches: With more CPU cores and memory on a resized node, the kernel has a better chance to spread workloads out efficiently. This can reduce contention between processes, leading to fewer context switches (which can be costly in terms of CPU time) +and less process interference and also reduces latency. +Better Memory Allocation: If the kernel has more memory available, it can allocate larger contiguous memory blocks, which can lead to better memory locality (i.e., keeping related data closer in physical memory),improved paging and swap limits thus +reducing latency for applications that rely on large datasets, in the case of a database applications. + +#### Story 3 + +As a Site Reliability Engineer (SRE), I want to reduce the operational complexity of managing multiple worker nodes, so that I can focus on fewer resources and simplify troubleshooting and monitoring. + +#### Story 4 + +As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster. + +#### Story 5 + +As a Cluster administrator, I expect my existing workloads to function without having to undergo a disruption which is induced during capacity addition followed by a node/kubelet restart to +detect the change in compute capacity, which can bring in additional complications. + +### Notes/Constraints/Caveats (Optional) + +### Risks and Mitigations + +- Change in OOMScoreAdjust value: + - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity` + - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the + actual OOMScoreAdjust for existing pods. + - This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for + recalculating the scores. +- Change in Swap limit: + - The formula to calculate the swap limit is `/)*` + - With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the + actual swap limit for existing pods. + - This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for + recalculating the scores. + +- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload. + - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur. + +- Lack of coordination about change in resource availability across kubelet/runtime/plugins. + - The plugins/runtime should be updated to react to change in resource information on the node. + +- Kubelet missing on processing hotplug instance(s) + - Kubelet observes the underlying node for any hotplug of resources as and when generated, + this ensures that the capacity is updated in set intervals and can technically not miss to update the actual capacity obtained from cAdvisor. + +- Handling downsize events + - Though, there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm. + - However, in a situation of downsize an error mode is returned by the kubelet and the node is marked as `NotReady`. + +- Workloads that are dependent on the initial node configuration, such as: + - Workloads that spawns per-CPU processes (threads, workpools, etc.) + - Workloads that depend on the CPU-Memory relationships (e.g Processes that depend on NUMA/NUMA alignment.) + - Dependency of external libraries/device drivers to support CPU hotplug as a supported feature. + + +## Design Details + +Below diagram is shows the interaction between kubelet, node and cAdvisor. + +```mermaid +sequenceDiagram + participant node + participant kubelet + participant cAdvisor-cache + participant machine-info + kubelet->>cAdvisor-cache: fetch + cAdvisor-cache->>machine-info: fetch + machine-info->>cAdvisor-cache: update + cAdvisor-cache->>kubelet: update + alt if increase in resource + kubelet->>node: recalculate and update OOMScoreAdj
and Swap limit of containers + kubelet->>node: re-initialize resource managers + kubelet->>node: node status update with new capacity + else if decrease in resource + kubelet->>node: set node status to not ready + end +``` + +The interaction sequence is as follows: +1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`. +2. If the machine resource is increased: + * Recalculate, update OOMScoreAdj and Swap limit of all the running containers. + * Re-initialize resource managers. + * Update node with new resource. +3. If the machine resource is decreased. + * Set node status to not ready. (This will be reverted when the current capacity exceeds or matches either the previous hot-plug capacity or the initial capacity + in case there was no history of hotplug.) + +With increase in cluster resources the following components will be updated: +1. Change in OOM score adjust: + * Currently, the OOM score adjust is calculated by + `1000 - (1000*containerMemReq)/memoryCapacity` + * Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods. + +2. Change in Swap Memory limit: + * Currently, the swap memory limit is calculated by + `(/)*` + * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods. + +3. Resource managers will re-initialised. + +4. Update in Node allocatable capacity. + +5. Scheduler: + * Scheduler will automatically schedule any pending pods. + * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the + available capacity of the node and creates pods accordingly. + + +### Handling hotplug events + +Once the capacity of the node is altered, the following are the sequence of events that occur in the kubelet. If any errors are +observed in any of the steps, operation is retried from step 1 along with a `FailedNodeResize` event under the node object. +1. Resizing existing containers: + a.With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to + the available memory on the host. This would lead to recalculation of oom_score_adj and swap_limits. + b.This is achieved by invoking the CRI API - UpdateContainerResources. + +2. Reinitialise Resource Manager: + a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest + available capacities under the node. + b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers. +3. Updating the node allocatable resources: + a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities, + the scheduler proceeds to schedule any pending pods. + + +#### Flow Control for updating swap limit for containers + +Formula to calculate Swap Limit: `(/)*` +``` +T=0: Node Resources: + - Memory: 6G + - Swap: 4G + Pod: + - container1 + - MemoryRequest: 2G + - State: Running + Runtime: + - /memory.swap.max: 1.33G + +T=1: Resize Instance to Hotplug Memory: + - Memory: 8G + - Swap: 4G + Pod: + - container1 + - MemoryRequest: 2G + - State: Running + Runtime: + - /memory.swap.max: 1G +``` + +Similar flow is applicable for updating oom_score_adj. + +### Handling HotUnplug Events + +Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.) +For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug. + +As the hot-unplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node +is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention. + +Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration. +In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug. + +#### Flow Control + +``` +T=0: Node initial Resources: + - Memory: 10G + - Pod: Memory + +T=1: Resize Instance to Hotplug Memory + - Current Memory: 10G + - Update Memory: 15G + - Node state: Ready + +T=2: Resize Instance to HotUnplug Memory + - Current Memory: 15G + - UpdatedMemory: 5G + - Node state: NotReady + +T=3: Resize Instance to Hotplug Memory + - Current Memory: 5G + - Updated Memory Size: 15G + - Node state: Ready +``` + +Few of the concerns surrounding hotunplug are listed below +* Pod re-admission: + * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running + or if it has to be terminated due to resource crunch. +* Recalculate OOM adjust score and Swap limits: + * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. +* Handling unplug of reserved CPUs. + +we intend to propose a separate KEP dedicated to hotunplug of resources to address the same. + +**Proposed Code changes** + +**Pseudocode for Resource Hotplug** + +```go +func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler, +syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool { + . + . + case machineInfo := <-kl.nodeResourceManager.MachineInfo(): + // Resize the containers. + klog.InfoS("Resizing containers due to change in MachineInfo") + if err := resizeContainers(); err != nil { + klog.ErrorS(err, "Failed to resize containers with change in machine info") + kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error()) + break + } + + // Resync the resource managers. + klog.InfoS("ResyncComponents resource managers because of change in MachineInfo") + if err := kl.containerManager.ResyncComponents(machineInfo); err != nil { + klog.ErrorS(err, "Failed to resync resource managers with machine info update") + kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error()) + break + } + + // Update the cached MachineInfo. + kl.setCachedMachineInfo(machineInfo) + . + . +} +``` + +**Changes to resource managers to adapt to hotplug of resources** + +1. Adding ResyncComponents() method to ContainerManager interface +```go + // Manages the containers running on a machine. + type ContainerManager interface { + . + . + // ResyncComponents will resyc the resource managers like cpu, memory and topology managers + // with updated machineInfo + ResyncComponents(machineInfo *cadvisorapi.MachineInfo) error + . + . + ) +``` + +2. Adding a method Sync to all the resource managers and will be invoked once there is resource hotplug. + +```go + // SyncMachineInfo will sync the Manager with the latest machine info + SyncMachineInfo(machineInfo *cadvisorapi.MachineInfo) error +``` + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Unit tests + +1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node scale up. +2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow. +3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change. +4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change. +5. Add necessary tests to validate change in oom_score and swap limit for containers post resize. + +##### e2e tests + +Following scenarios need to be covered: + +* Node resource information before and after resource hot plug for the following scenarios. + * upsize -> downsize + * upsize -> downsize -> upsize + * downsize- > upsize +* State of Pending pods due to lack of resources after resource hot plug. +* Resource manager states after the resync of components. + +### Graduation Criteria + + +#### Phase 1: Alpha (target 1.34) + +* Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the `NodeResourceHotPlug` + feature gate. +* Unit test coverage. +* E2E tests. +* Documentation mentioning high level design. + + +### Upgrade / Downgrade Strategy + + + +##### Upgrade + +To upgrade the cluster to use this feature, Kubelet should be updated to enable featuregate. +Existing cluster does not have any impact as the node resources already been updated during cluster creation. + +##### Downgrade + +It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster. + + +### Version Skew Strategy + + + +Not relevant, As this kubelet specific feature and does not impact other components. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name:NodeResourceHotPlug + - Components depending on the feature gate: kubelet +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + +No. This feature is guarded by a feature gate. Existing default behavior does not change if the +feature is not used. +Even if the feature is enabled via feature gate, If there is no change in +node configuration the system will continue to work in the same way. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. The feature can be disabled by restarting kubelet with the feature-gate off. +Once disabled any hot plug of resources won't reflect at the cluster level. + +###### What happens if we reenable the feature if it was previously rolled back? + +To reenanble the feature, need to turn on the feature-gate and restart the kubelet, +with feature reenabled, the node resources can be hot plugged in again. Cluster will be automatically updated +with the new resource information. If there are any pending pods due to lack of resources they will turn into +running state. + +###### Are there any tests for feature enablement/disablement? + +Yes, the tests will be added along with alpha implementation. +* Validate the hot plug of resource to machine is updated at the node resource level. +* Validate the hot plug of resource made the pending pods to transition into running state. +* Validate the resource managers are update with the latest machine information after hot plug of resources. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +Rollout may fail if the resource managers are not re-synced properly due to programmatic errors. +In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain +in the pending state only. +Rollback failure should not affect running workloads. + +###### What specific metrics should inform a rollback? + + +If there is significant increase in `node_resize_resync_errors_total` metric means the feature is not working as expected. +In case of pending pods and hot plug of resource but still there is no change `scheduler_pending_pods` metric +means the feature is not working as expected. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +It will be tested manually as a part of implementation and there will also be automated tests to cover the scenarios. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No +### Monitoring Requirements + + + +Monitor the metrics +- `node_resize_resync_request_total` +- `node_resize_resync_errors_total` + +###### How can an operator determine if the feature is in use by workloads? + + + +This feature will be built into kubelet and behind a feature gate. Examining the kubelet feature gate would help +in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the +`kubernetes_feature_enabled` metric. + +In addition, newly added metrics `node_resize_resync_request_total`, `node_resize_resync_errors_total` are incremented in case of up-scale of resource +and failing to re-sync resources managers respectively. + +###### How can someone using this feature know that it is working for their instance? + + + +End user can do a hot plug of resource and verify the same change as reflected at the node resource level. +In case there were any pending pods prior to resource hot plug, those pods should transition into Running with addition +of new resources. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +For each node, the value of the metric `node_resize_resync_request_total` is expected to match the number of time the node is resized. +For each node, the value of the metric `node_resize_resync_errors_total` is expected to be zero. + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [X] Metrics + - Metric name: + - `node_resize_resync_request_total` + - `node_resize_resync_errors_total` + - Components exposing the metric: kubelet + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + +- `node_resize_resync_request_total` +- `node_resize_resync_errors_total` + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + +No, It does not depend on any service running on the cluster, But depends on cAdvisor package to fetch +the machine resource information. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No, It won't add/modify any user facing APIs. +The resource managers might need to be updated with new methods to resync their components with updated +machine information. + +###### Will enabling / using this feature result in introducing new API types? + + +No +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +No +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +Negligible, In the case of resource hot plug the resource manager may take some time to resync. +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +Negligible computational overhead might be introduced into kubelet as it periodically needs to fetch machine information +from cAdvisor cache and resync all the resource managers with the updated machine information. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +Yes, It could. +Since the nodes computational capacity is increased dynamically there might be more pods scheduled on the node. +This is however be mitigated by maxPods kubelet configuration that limits the number of pods on a node. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +This feature is node local and mainly handled in kubelet, It has no dependency on etcd. +In case there are pending pods and there is hot plug of resources, The scheduler relies on the API server to fetch node information. +Without access to the API server, it cannot make scheduling decisions as the node resources are not updated. The pending pods would remain in same condition. + +###### What are other known failure modes? + + + +This feature mainly does two things fetch machine information from cAdvisor and reinitialize resource managers. +Failure scenarios can occur in cAdvisor level that is if it wrongly updated with incorrect machine information. + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +If the SLOs are not being met one can examine the kubelet logs and its also advised not to hotplug the node resources. + +## Implementation History + + + +## Drawbacks + + + +Currently, This KEP only focuses on resource hotplug however in a case where the node is downsized its possible that the +nodes capacity may be lower than existing workloads memory requirement. + +## Alternatives + +* Horizontally scale the cluster by incorporating additional compute nodes. +* Use fake placeholder resources that are available but not enabled (e.g., balloon drivers) + + + +## Infrastructure Needed (Optional) +VMs of cluster should support hot plug of compute resources for e2e tests. + +## Future Work + +* Support hot-unplug of node resources: + * Pod re-admission: + * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running + or if it has to be terminated due to resource crunch. + * Recalculate OOM adjust score and Swap limits: + * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. + * Handling unplug of reserved CPUs. + +* Fetching machine info via CRI + * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose. + * Presently, resource managers are updated through regular polling. Once the CRI APIs are enhanced to fetch machine information, we can significantly enhance the reinitialization of resource managers, + enabling them to respond more effectively to resize events. + +* Knobs to alter Kube and System reserved + * Currently, these values are calculated and set by individual cloud providers or vendors. + * This can be further explored to enable options to set the kube and system reserved capacities as tunables. \ No newline at end of file diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml new file mode 100644 index 00000000000..89444d4672c --- /dev/null +++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml @@ -0,0 +1,51 @@ +title: Node Resource Hot Plug +kep-number: 3953 +authors: + - "@Karthik-K-N" + - "@mkumatag" + - "@kishen-v" +owning-sig: sig-node +participating-sigs: + - sig-node +status: provisional +creation-date: 2023-10-04 +reviewers: + - "@smarterclayton" + - "@ffromani" + - "@SergeyKanzhelev" + - "@haircommander" + - "@tallclair" +approvers: + - "@haircommander" + - "@SergeyKanzhelev" + - "@ffromani" + - "@mrunalp" +see-also: +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: "alpha" + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.34" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "" + beta: "" + stable: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: NodeResourceHotPlug + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - node_resize_resync_request_total + - node_resize_resync_errors_total