KEP-5825: update CRI List Streaming for beta graduation in v1.38 by bitoku · Pull Request #6146 · kubernetes/enhancements

bitoku · 2026-06-03T14:51:18Z

Assisted-by: Claude

One-line PR description: Update KEP-5825 for beta graduation

Issue link: CRI List Streaming #5825

Other comments:
- k/k change in Alpha
  - [KEP-5825] cri-api: Add streaming RPCs for CRI list operations kubernetes#136987

Update KEP metadata, PRR answers, and metrics strategy. Replace proposed streaming-specific metrics with existing kubelet CRI metrics that cover streaming transparently.

k8s-ci-robot · 2026-06-03T14:51:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bitoku
Once this PR has been reviewed and has the lgtm label, please assign deads2k, mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92 · 2026-06-03T15:16:45Z

+An unexpected increase in list operation latency after enabling the feature
+gate may indicate streaming overhead.

 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?


Can you do something like https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested?

We want to confirm that you are testing upgrades of this feature even if manually.

kannon92 · 2026-06-03T15:21:07Z

+- `kubelet_runtime_operations_duration_seconds` for list operations stable: no
+  latency regression from streaming.

 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?


We have https://github.com/kubernetes/community/blob/main/sig-scalability/slos/pod_startup_latency.md#pod-startup-latency-slislo-details but I'm not sure if CRI list will impact pod startup?

I guess it shouldn't but maybe we should include that as a SLO as increased pod startup time would be concerning operationally.

kannon92 · 2026-06-03T15:23:53Z

@@ -369,7 +369,8 @@ kubelet does not have integration tests.
 - Feature gate enabled by default
 - Both containerd and CRI-O implement streaming RPCs


Do we have containerd support for this?

Not yet, the PR is pending.
containerd/containerd#13187

So the requirement for beta is to have both implement this? Should we push this feature to 1.38?

I'm not as clear on CRI support for containerd skew. Even if this merges will we have a production release in time for 1.37?

So the requirement for beta is to have both implement this?

Yes.

Should we push this feature to 1.38?
I'm not as clear on CRI support for containerd skew. Even if this merges will we have a production release in time for 1.37?

I was thinking we could merge the KEP while we're waiting for the PR merged.
However, as you said, containerd 2.4 release is scheduled to August, which is prior to 1.37 Code Freeze (end of July). We can't write test with the new containerd version, right?

I guess we may want to push to 1.38.

ref: https://containerd.io/releases/#current-state-of-containerd-releases

cc @haircommander @samuelkarp

if the PR is staged for a release candidate before test freeze then I think it's acceptable. But we also could defer if needed, I just think this feature is gaining in importance as we look at extending the number of pods per node cc @SergeyKanzhelev

I disagree. We won't be able to have testing for containerd before code freeze. THere is also no test freeze anymore.

So we would be effectively going beta without testing containerd in CI.

I think we need to wait for containerd support. It is important feature, but not urgent. I am a bit sad we didn't merge the PR to containerd in-time for 2.3

Sure, thank you all for taking a look. I'll keep the PR open to discuss more about the design, but I'll update the KEP issue to tell it'll be delayed to 1.38.

kannon92 · 2026-06-03T15:28:22Z

+- Mid-stream failures already propagate as errors to the instrumented
+  layer and are captured by `kubelet_runtime_operations_errors_total`.
+- Fallback to unary is cached after the first occurrence (via
+  `atomic.Bool`) and only happens a few times at startup. Kubelet logs


Logs are a poor way to support a feature rollout.

I did an AI review and it has this to say:

No streaming-specific observability — this is the biggest concern

The alpha KEP originally promised dedicated streaming metrics (kubelet_cri_list_streaming_failure_total, kubelet_cri_list_streaming_fallback_total). The beta update removes these and argues existing metrics are sufficient. The justification (avoiding metrics in the cri-client staging package) is reasonable from a code-cleanliness perspective, but operationally this means: - An operator cannot tell whether streaming is active or the kubelet silently fell back to unary. The only signal is a kubelet Info-level log line. At scale, log-based detection is fragile — operators rely on metrics and dashboards. - You cannot distinguish a streaming mid-stream failure from any other CRI error. kubelet_runtime_operations_errors_total is a catch-all. If streaming introduces a new failure mode, it's invisible in the existing metric without label enrichment. - Recommend at minimum: a log-based metric or a label on the existing metric (path=streaming|unary) to let operators confirm code path. If the architectural argument holds (no metrics in cri-client), consider emitting a kubelet-level event or condition that's scrapeable.

I still thinking something more than logs to detect fall back is important. How would you detect if this feature is misheaving on a 10000 node cluster?

There's some difficulty having the streaming specific metrics because cri-client is a reusable client library with no metrics. Registering metrics via init() would inject kubelet-specific side effects into every importer (tests, third-party tools).

I'm exploring the possible solutions for it, and found we could do something like
kubernetes/kubernetes#139500
It doesn't affect other importers and can emit metrics from kubelet.

I'd like to hear opinions about it.

I think this approach is reasonable and similar to other instrumented services

cc @SergeyKanzhelev

kannon92

Out of diff:

Can you fill out the checkbox for https://github.com/bitoku/kubernetes-enhancements/blob/def4672e67e92e65dc510ec5def8738c0a9df4bd/keps/sig-node/5825-cri-pagination/README.md#release-signoff-checklist

bitoku · 2026-06-05T14:30:57Z

/hold
delay it to 1.38

KEP-5825: update CRI List Streaming for beta graduation in v1.37

def4672

Update KEP metadata, PRR answers, and metrics strategy. Replace proposed streaming-specific metrics with existing kubelet CRI metrics that cover streaming transparently.

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 3, 2026

k8s-ci-robot requested review from derekwaynecarr and kikisdeliveryservice June 3, 2026 14:51

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 3, 2026

bitoku mentioned this pull request Jun 3, 2026

CRI List Streaming #5825

Open

13 tasks

kannon92 reviewed Jun 3, 2026

View reviewed changes

kannon92 mentioned this pull request Jun 5, 2026

Add cri_streaming_enabled metric for CRI list streaming RPCs kubernetes/kubernetes#139500

Open

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2026

bitoku changed the title ~~KEP-5825: update CRI List Streaming for beta graduation in v1.37~~ KEP-5825: update CRI List Streaming for beta graduation in v1.38 Jun 5, 2026

		@@ -369,7 +369,8 @@ kubelet does not have integration tests.
		- Feature gate enabled by default
		- Both containerd and CRI-O implement streaming RPCs

Conversation

bitoku commented Jun 3, 2026

Uh oh!

k8s-ci-robot commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bitoku Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bitoku Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

bitoku commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bitoku Jun 4, 2026 •

edited

Loading

bitoku Jun 4, 2026 •

edited

Loading

bitoku commented Jun 5, 2026 •

edited

Loading