watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064

zhijun42 · 2025-12-31T13:51:00Z

Reproduces issues #18879 and #20716

The next PR #21065 properly solves these issues.

These two issues have the same root cause, and it only shows up with the in-process client–server path, not with a regular gRPC connection. Over the network, HTTP/2 flow control and socket teardown naturally apply backpressure and reset state, but the in-process transport is using the chanStream in server/proxy/grpcproxy/adapter/chan_stream.go , which simply relies on go channels.

When we have a lot of watch cancel requests, we would easily run into deadlock. Refer to #20716 for a nice diagram analyzing the deadlock.

Conceptually, each watch stream should be owned by a single client: if we rerun the program, we should expect to start from a clean slate. Unfortunately, with the in-process design there are effectively two layers of clients:

user's client program
the long-lived in-process transport client that talks directly to the server - essentially a proxy sits between the user clients and the etcd server.

Rerunning the user program doesn’t reset the in-process client, so the connection between this in-process client and the server is still stuck in deadlock, and all subsequent user reruns would also be stuck (as long as these user programs rely on this in-process client, but notice that if you do etcdctl or other programs that can directly talk to the etcd server, you can still use the watch feature without problem).

This PR adds an integration test to reproduce this issue. It creates a lot of watchers, ensures they work fine at first, and then cancels most of these watchers. The remaining watchers stop receiving any watch events, and we will see non-zero etcd_debugging_mvcc_slow_watcher_total and etcd_debugging_mvcc_pending_events_total metrics.

Signed-off-by: Zhijun <[email protected]>

k8s-ci-robot · 2025-12-31T13:51:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhijun42
Once this PR has been reviewed and has the lgtm label, please assign siyuanfoundation for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-12-31T13:51:10Z

Hi @zhijun42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Zhijun <[email protected]>

Yeah a fully working version of reproduction test

4c064a4

Signed-off-by: Zhijun <[email protected]>

k8s-ci-robot added the area/testing label Dec 31, 2025

k8s-ci-robot added needs-ok-to-test size/L labels Dec 31, 2025

zhijun42 changed the title ~~Yeah a fully working version of reproduction test~~ watch: Reproduce the deadlock between in-process client-server due to cancellation storm Dec 31, 2025

More watchers

fdc9fa6

Signed-off-by: Zhijun <[email protected]>

This was referenced Dec 31, 2025

watch: Fix the deadlock between in-process client-server due to cancellation storm #21065

Open

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

Open

zhijun42 added 2 commits December 31, 2025 22:37

Lint

be44e5e

Signed-off-by: Zhijun <[email protected]>

Use Positivef

c94312c

Signed-off-by: Zhijun <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064

watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064

Uh oh!

zhijun42 commented Dec 31, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 31, 2025

Uh oh!

k8s-ci-robot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064

Are you sure you want to change the base?

watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064

Uh oh!

Conversation

zhijun42 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 31, 2025

Uh oh!

k8s-ci-robot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

zhijun42 commented Dec 31, 2025 •

edited

Loading