KEP-4979: Evented desired state of world populator in kubelet volume manager

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details - Unit tests - Integration tests - e2e tests
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes optimizing the loop iteration period (currently fixed at 100ms) in the Desired State of the World Populator (DSWP) in kubelet volume manager. The enhancement involves dynamically increasing the sleep period when no changes are detected and reacting to state pod manager and pod worker channels .

Motivation

In the volume manager, the Desired State of the World executes a populator loop every 100ms, regardless of whether any changes have occurred. This fixed frequency may result in unnecessary CPU cycles during idle periods and also increasing the waiting period during the pod sync loop iteration. By adopting an event-based approach, the kubelet can respond precisely when changes occur, improving performance and reducing system overhead.

The diagram below illustrates how a kubelet sync loop iteration works, with a focus on Volume Manager behavior :

On the other hand, the Unmount process follows this flow:

Goals

Reducing the waiting period during the sync loop iteration allows pods to start and delete more quickly.
Dynamically adjust the populator loop interval based on system activity.
Respond promptly to events, ensuring up-to-date DSWP cache.
Maintain existing functionality as a fallback to ensure reliability.

Non-Goals

Completely remove the batch loop period.
Change the existing DSWP logic.

Proposal

The Desired State of the World Populator will listen to pod manager and pod worker channel . Every changes made by pod manager(add and update actions) and pod worker ( completeTerminating action ) will trigger the populator loop immediately . During periods of inactivity, the populator loop interval will increase by 100ms increments after the third execution, up to a maximum of 1 second. If an event is detected, the interval resets to the default 100ms. This approach ensures responsiveness while reducing CPU usage.

Risks and Mitigations

Since the event is emitted by the Kubelet (Pod Manager/pod worker) for the Kubelet (DSWP), the loss of the event poses a minimal risk.

Design Details

Trigger the existing DSWP implementation using a channel provided by the Pod Manager and pod worker. The Pod Manager acts as the source of truth for DSWP, and its channel listens for all changes made by it.

On the Pod Manager side , these functions will emit an event on state channel whenever there is a change in its state.

kubernetes/pkg/kubelet/pod/pod_manager.go

type Manager interface {
  ....
	SetPods(pods []*v1.Pod)
	AddPod(pod *v1.Pod)
	UpdatePod(pod *v1.Pod)
	RemovePod(pod *v1.Pod) // Unmount is triggered on the pod worker side

On the Pod worker side , this function will emit an event on state channel whenever there is a change in its state.

kubernetes/pkg/kubelet/pod_workers.go

func (p *podWorkers) completeTerminating(podUID types.UID){
  //...
}

Gradually increase after the third execution (to no impact the existing retry logic ) ( +100ms on each iteration) the sleep period to a 1 second maximum. If any event is detected, reset the interval back to the initial value (100ms).

The new diagram reflects the changes after enabling the feature :

On the Unmount side :

Unit tests

Dynamic sleep period unit tests
Increase the sleep period unit tests

Integration tests

Verify the desired state of the world cache is updated correctly when the Pod manager/pod worker events are received.
Generate a large number of pod manager events within a short period of time and check if the desired state of the world loop is triggered correctly within a short period.

e2e tests

The existing node e2e tests and integration tests for DSWP must pass. All validation tests are designed and implemented during the integration test phase.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Existing node e2e tests and integration tests around DSWP must pass

Beta

Add integration tests

GA

Allowing time for feedback
Wait two releases before going to GA

Deprecation

N/A

Since the batch mode will cohexist with the event mode,so no depcrecation is needed.

Upgrade / Downgrade Strategy

N/A

Version Skew Strategy

N/A.

Since this feature alters only the way kubelet determines DSWP sleep period, this section is irrelevant to this feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: EventedDesiredStateOfWorldPopulator

Does enabling the feature change any default behavior?

This feature does not introduce any user facing changes. Although users should notice increased performance of the kubelet.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, kubelet needs to be restarted to disable this feature.

What happens if we reenable the feature if it was previously rolled back?

If reenabled, kubelet will again start updating the DSWP sleep period based on pod manager/pod worker events. Everytime this feature is enabled or disabled, the kubelet will need to be restarted.

Are there any tests for feature enablement/disablement?

Current unit tests are checked without enabling/disabling FG, but for integration and e2e testing, FG (beta graduation) will need to be enabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This feature relies on a channel provided by the pod manager/pod worker to dynamically adjust the DSWP sleep period. So no external component(CRI for example) involved at this stage .

Failures during rollout or rollback are unlikely to impact already running workloads, as the core functionality of the DSWP remains unchanged, and the system defaults to the original polling behavior.

What specific metrics should inform a rollback?

N/A.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, I tested this feature locally using ./hack/local-up-cluster.sh.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Whenever a pod is updated (added, updated or removed) the kubelet metric evented_pod_manager_update_count is increased consistently.

How can someone using this feature know that it is working for their instance?

Observe the pod_manager_update_count metric.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The DSWP runs immediately or at least <= 100ms after the desired(pod manager/pod worker) state of the pod has been changed.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: pod_manager_update_count
- Components exposing the metric: kubelet

Are there any missing metrics that would be useful to have to improve observability of this feature?

Metrics
- Metric name: evented_dswp_process_event_processing_delay
- Metric description: exposing the delay period between the event emitted by pod manager/pod worker and the exact time of DSWP has been executed.
- Components exposing the metric: kubelet

Dependencies

N/A.

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature does not depend directly on the API server / etcd but on the pod manager/pod worker(kubelet) behavior.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Proposal 1 : kubernetes/kubernetes#126450 : the PR allows users to customize or override the loop period configuration using the kubelet conf file :

Reason/suggestion ( sig node ) : move to event-based approach: kubernetes/kubernetes#126049 (comment)

Proposal 2: kubernetes/kubernetes#126668 : This proposal increases the timer without the event-based approach. If a change is detected, the function resets the sleep period. However, this PR will likely be closed since changes are detected late.
Proposal 3: React based on CRI event : the container creation (CRI event) does not precede volume mounting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!