Skip to content

KEP-5825: update CRI List Streaming for beta graduation in v1.38#6146

Open
bitoku wants to merge 1 commit into
kubernetes:masterfrom
bitoku:kep-5825-beta
Open

KEP-5825: update CRI List Streaming for beta graduation in v1.38#6146
bitoku wants to merge 1 commit into
kubernetes:masterfrom
bitoku:kep-5825-beta

Conversation

@bitoku

@bitoku bitoku commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Assisted-by: Claude

  • One-line PR description: Update KEP-5825 for beta graduation

Update KEP metadata, PRR answers, and metrics strategy. Replace
proposed streaming-specific metrics with existing kubelet CRI metrics
that cover streaming transparently.
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bitoku
Once this PR has been reviewed and has the lgtm label, please assign deads2k, mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 3, 2026
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 3, 2026
@bitoku bitoku mentioned this pull request Jun 3, 2026
13 tasks
An unexpected increase in list operation latency after enabling the feature
gate may indicate streaming overhead.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- `kubelet_runtime_operations_duration_seconds` for list operations stable: no
latency regression from streaming.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have https://github.com/kubernetes/community/blob/main/sig-scalability/slos/pod_startup_latency.md#pod-startup-latency-slislo-details but I'm not sure if CRI list will impact pod startup?

I guess it shouldn't but maybe we should include that as a SLO as increased pod startup time would be concerning operationally.

@@ -369,7 +369,8 @@ kubelet does not have integration tests.
- Feature gate enabled by default
- Both containerd and CRI-O implement streaming RPCs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have containerd support for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, the PR is pending.
containerd/containerd#13187

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the requirement for beta is to have both implement this? Should we push this feature to 1.38?

I'm not as clear on CRI support for containerd skew. Even if this merges will we have a production release in time for 1.37?

@bitoku bitoku Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the requirement for beta is to have both implement this?

Yes.

Should we push this feature to 1.38?
I'm not as clear on CRI support for containerd skew. Even if this merges will we have a production release in time for 1.37?

I was thinking we could merge the KEP while we're waiting for the PR merged.
However, as you said, containerd 2.4 release is scheduled to August, which is prior to 1.37 Code Freeze (end of July). We can't write test with the new containerd version, right?

I guess we may want to push to 1.38.

ref: https://containerd.io/releases/#current-state-of-containerd-releases

cc @haircommander @samuelkarp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the PR is staged for a release candidate before test freeze then I think it's acceptable. But we also could defer if needed, I just think this feature is gaining in importance as we look at extending the number of pods per node cc @SergeyKanzhelev

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We won't be able to have testing for containerd before code freeze. THere is also no test freeze anymore.

So we would be effectively going beta without testing containerd in CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to wait for containerd support. It is important feature, but not urgent. I am a bit sad we didn't merge the PR to containerd in-time for 2.3

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you all for taking a look. I'll keep the PR open to discuss more about the design, but I'll update the KEP issue to tell it'll be delayed to 1.38.

- Mid-stream failures already propagate as errors to the instrumented
layer and are captured by `kubelet_runtime_operations_errors_total`.
- Fallback to unary is cached after the first occurrence (via
`atomic.Bool`) and only happens a few times at startup. Kubelet logs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs are a poor way to support a feature rollout.

I did an AI review and it has this to say:

  1. No streaming-specific observability — this is the biggest concern
The alpha KEP originally promised dedicated streaming metrics
(kubelet_cri_list_streaming_failure_total,
kubelet_cri_list_streaming_fallback_total). The beta update removes these and
argues existing metrics are sufficient. The justification (avoiding metrics in
the cri-client staging package) is reasonable from a code-cleanliness
perspective, but operationally this means:

- An operator cannot tell whether streaming is active or the kubelet silently 
fell back to unary. The only signal is a kubelet Info-level log line. At
scale, log-based detection is fragile — operators rely on metrics and
dashboards.
- You cannot distinguish a streaming mid-stream failure from any other CRI 
error. kubelet_runtime_operations_errors_total is a catch-all. If streaming
introduces a new failure mode, it's invisible in the existing metric without
label enrichment.
- Recommend at minimum: a log-based metric or a label on the existing metric
(path=streaming|unary) to let operators confirm code path. If the
architectural argument holds (no metrics in cri-client), consider emitting a
kubelet-level event or condition that's scrapeable.

I still thinking something more than logs to detect fall back is important. How would you detect if this feature is misheaving on a 10000 node cluster?

@bitoku bitoku Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some difficulty having the streaming specific metrics because cri-client is a reusable client library with no metrics. Registering metrics via init() would inject kubelet-specific side effects into every importer (tests, third-party tools).

I'm exploring the possible solutions for it, and found we could do something like
kubernetes/kubernetes#139500
It doesn't affect other importers and can emit metrics from kubelet.

I'd like to hear opinions about it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is reasonable and similar to other instrumented services

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92 kannon92 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitoku

bitoku commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/hold
delay it to 1.38

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2026
@bitoku bitoku changed the title KEP-5825: update CRI List Streaming for beta graduation in v1.37 KEP-5825: update CRI List Streaming for beta graduation in v1.38 Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants