Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[evented pleg]: using real-time container events for pod state determination #129355

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Dec 21, 2024

What type of PR is this?

/kind bug

Which issue(s) this PR fixes:

Fixes #124704 #121003

Special notes for your reviewer:

The purpose of this PR is to resolve #124704

I have carefully introduced the causes of this problem and various solutions we've attempted to resolve it in https://docs.google.com/document/d/1TPrY56q9MNW8r1FuzKDFkBBhOjQ0hqi7wJAbIP1O-4g/


This PR proposes that during its execution cycle, podWorkerLoop should directly use the latest events reported by the container runtime as the current Pod state. If we can confirm that the events reported by the container runtime are reliable and up-to-date, we can stop relying on timestamps to determine whether the state is current. In both Generic PLEG and the current Evented PLEG, if the container state does not change, the timestamps in the cache are still updated unnecessarily (every 5 seconds and 1 second, respectively). This shows that timestamp-based checks are meaningless when the container state remains unchanged

The current container lifecycle code is not yet ready to directly accept the states reported by the container runtime. To address this, the evented PLEG needs to filter out sandbox creation and deletion events, as the container runtime reports the same events for both sandbox containers and regular containers, and kubelet has never handled these issues. Consequently, an additional condition has been added to ensure that sandbox container creation events do not trigger lifecycle processing before regular container creation, and sandbox container deletion events do not trigger lifecycle processing after regular container deletion. However, the cache will still be updated to reflect the latest state.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3386-kubelet-evented-pleg/README.md

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 21, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Dec 21, 2024
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 21, 2024
@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 21, 2024
@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 27, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2025
@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

@harche
Copy link
Contributor

harche commented Feb 26, 2025

@HirazawaUi #129355 (review)

that comment would be applicable everywhere except in evented.go file. We are taking a very cautious approach to make sure the changes remain isolated from existing generic pleg.

@HirazawaUi
Copy link
Contributor Author

@HirazawaUi #129355 (review)

that comment would be applicable everywhere except in evented.go file. We are taking a very cautious approach to make sure the changes remain isolated from existing generic pleg.

Thanks for the reminder!

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

@HirazawaUi
Copy link
Contributor Author

@harche @haircommander Will we be able to move forward and merge this PR within the v1.33 cycle? IMO, if we can merge it earlier, it will allow users and developers to validate the reliability of this feature sooner, and we can also advance to the beta stage more quickly :)

@harche
Copy link
Contributor

harche commented Mar 10, 2025

@harche @haircommander Will we be able to move forward and merge this PR within the v1.33 cycle? IMO, if we can merge it earlier, it will allow users and developers to validate the reliability of this feature sooner, and we can also advance to the beta stage more quickly :)

@HirazawaUi I am trying to test these changes in CRIO CI, cri-o/cri-o#9053, but looks like I might be goofing up in setting it up correctly. Looking into it.

@HirazawaUi
Copy link
Contributor Author

@HirazawaUi I am trying to test these changes in CRIO CI, cri-o/cri-o#9053, but looks like I might be goofing up in setting it up correctly. Looking into it.

Thanks!

@harche
Copy link
Contributor

harche commented Mar 11, 2025

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

1 similar comment
@harche
Copy link
Contributor

harche commented Mar 11, 2025

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

@harche
Copy link
Contributor

harche commented Mar 11, 2025

@HirazawaUi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e 16b9490 link false /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
pull-kubernetes-e2e-kind-evented-pleg 16b9490 link false /test pull-kubernetes-e2e-kind-evented-pleg
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

/hold

@HirazawaUi
Copy link
Contributor Author

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

@HirazawaUi
Copy link
Contributor Author

It seems that #130599 made some adjustments to features related to pleg, resulting in the removal of import records. Consequently, after merging the main branch code, this branch also lost those import records, causing compilation failures.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 12, 2025
@HirazawaUi
Copy link
Contributor Author

The latest submitted code removes unrelated code formatting changes (such as whitespace deletions) and modifies the duration for determining whether a container has started into a constant.

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-e2e-kind-evented-pleg

@HirazawaUi
Copy link
Contributor Author

/retest

@HirazawaUi
Copy link
Contributor Author

/hold cancel
Since the #129355 (comment) have been addressed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Pod phase transition is slower when EventedPLEG is enabled
10 participants