- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details - Unit tests - Integration tests - e2e tests
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes optimizing the loop iteration period (currently fixed at 100ms) in the Desired State of the World Populator (DSWP) in kubelet volume manager. The enhancement involves dynamically increasing the sleep period when no changes are detected and reacting to state pod manager and pod worker channels .
In the volume manager, the Desired State of the World executes a populator loop every 100ms, regardless of whether any changes have occurred. This fixed frequency may result in unnecessary CPU cycles during idle periods and also increasing the waiting period during the pod sync loop iteration. By adopting an event-based approach, the kubelet can respond precisely when changes occur, improving performance and reducing system overhead.
The diagram below illustrates how a kubelet sync loop iteration works, with a focus on Volume Manager behavior :
On the other hand, the Unmount process follows this flow:
- Reducing the waiting period during the sync loop iteration allows pods to start and delete more quickly.
- Dynamically adjust the populator loop interval based on system activity.
- Respond promptly to events, ensuring up-to-date DSWP cache.
- Maintain existing functionality as a fallback to ensure reliability.
- Completely remove the batch loop period.
- Change the existing DSWP logic.
The Desired State of the World Populator will listen to pod manager and pod worker channel . Every changes made by pod manager(add and update actions) and pod worker ( completeTerminating action ) will trigger the populator loop immediately . During periods of inactivity, the populator loop interval will increase by 100ms increments after the third execution, up to a maximum of 1 second. If an event is detected, the interval resets to the default 100ms. This approach ensures responsiveness while reducing CPU usage.
Since the event is emitted by the Kubelet (Pod Manager/pod worker) for the Kubelet (DSWP), the loss of the event poses a minimal risk.
Trigger the existing DSWP implementation using a channel provided by the Pod Manager and pod worker. The Pod Manager acts as the source of truth for DSWP, and its channel listens for all changes made by it.
On the Pod Manager side , these functions will emit an event on state channel whenever there is a change in its state.
kubernetes/pkg/kubelet/pod/pod_manager.go
type Manager interface {
....
SetPods(pods []*v1.Pod)
AddPod(pod *v1.Pod)
UpdatePod(pod *v1.Pod)
RemovePod(pod *v1.Pod) // Unmount is triggered on the pod worker side
On the Pod worker side , this function will emit an event on state channel whenever there is a change in its state.
kubernetes/pkg/kubelet/pod_workers.go
func (p *podWorkers) completeTerminating(podUID types.UID){
//...
}
Gradually increase after the third execution (to no impact the existing retry logic ) ( +100ms on each iteration) the sleep period to a 1 second maximum. If any event is detected, reset the interval back to the initial value (100ms).
The new diagram reflects the changes after enabling the feature :
- Dynamic sleep period unit tests
- Increase the sleep period unit tests
- Verify the desired state of the world cache is updated correctly when the Pod manager/pod worker events are received.
- Generate a large number of pod manager events within a short period of time and check if the desired state of the world loop is triggered correctly within a short period.
- The existing
node e2e
tests and integration tests for DSWP must pass. All validation tests are designed and implemented during the integration test phase.
- Feature implemented behind a feature flag
- Existing
node e2e
tests and integration tests around DSWP must pass
- Add integration tests
- Allowing time for feedback
- Wait two releases before going to GA
N/A
Since the batch mode will cohexist with the event mode,so no depcrecation is needed.
N/A
N/A.
Since this feature alters only the way kubelet determines DSWP sleep period, this section is irrelevant to this feature.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: EventedDesiredStateOfWorldPopulator
This feature does not introduce any user facing changes. Although users should notice increased performance of the kubelet.
Yes, kubelet needs to be restarted to disable this feature.
If reenabled, kubelet will again start updating the DSWP sleep period based on pod manager/pod worker events. Everytime this feature is enabled or disabled, the kubelet will need to be restarted.
Current unit tests are checked without enabling/disabling FG, but for integration and e2e testing, FG (beta graduation) will need to be enabled.
This feature relies on a channel provided by the pod manager/pod worker to dynamically adjust the DSWP sleep period. So no external component(CRI for example) involved at this stage .
Failures during rollout or rollback are unlikely to impact already running workloads, as the core functionality of the DSWP remains unchanged, and the system defaults to the original polling behavior.
N/A.
Yes, I tested this feature locally using ./hack/local-up-cluster.sh
.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Whenever a pod is updated (added, updated or removed) the kubelet metric evented_pod_manager_update_count
is increased consistently.
Observe the pod_manager_update_count
metric.
The DSWP runs immediately or at least <= 100ms after the desired(pod manager/pod worker) state of the pod has been changed.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: pod_manager_update_count
- Components exposing the metric: kubelet
Are there any missing metrics that would be useful to have to improve observability of this feature?
- Metrics
- Metric name: evented_dswp_process_event_processing_delay
- Metric description: exposing the delay period between the event emitted by pod manager/pod worker and the exact time of DSWP has been executed.
- Components exposing the metric: kubelet
N/A.
No.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
The feature does not depend directly on the API server / etcd but on the pod manager/pod worker(kubelet) behavior.
N/A
- Proposal 1 : kubernetes/kubernetes#126450 : the PR allows users to customize or override the loop period configuration using the kubelet conf file :
Reason/suggestion ( sig node ) : move to event-based approach: kubernetes/kubernetes#126049 (comment)
-
Proposal 2: kubernetes/kubernetes#126668 : This proposal increases the timer without the event-based approach. If a change is detected, the function resets the sleep period. However, this PR will likely be closed since changes are detected late.
-
Proposal 3: React based on CRI event : the container creation (CRI event) does not precede volume mounting.