watch: Reproduce the deadlock between in-process client-server due to cancellation storm #21064
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reproduces issues #18879 and #20716
The next PR #21065 properly solves these issues.
These two issues have the same root cause, and it only shows up with the in-process client–server path, not with a regular gRPC connection. Over the network, HTTP/2 flow control and socket teardown naturally apply backpressure and reset state, but the in-process transport is using the
chanStreaminserver/proxy/grpcproxy/adapter/chan_stream.go, which simply relies on go channels.When we have a lot of watch cancel requests, we would easily run into deadlock. Refer to #20716 for a nice diagram analyzing the deadlock.
Conceptually, each watch stream should be owned by a single client: if we rerun the program, we should expect to start from a clean slate. Unfortunately, with the in-process design there are effectively two layers of clients:
Rerunning the user program doesn’t reset the in-process client, so the connection between this in-process client and the server is still stuck in deadlock, and all subsequent user reruns would also be stuck (as long as these user programs rely on this in-process client, but notice that if you do
etcdctlor other programs that can directly talk to the etcd server, you can still use the watch feature without problem).This PR adds an integration test to reproduce this issue. It creates a lot of watchers, ensures they work fine at first, and then cancels most of these watchers. The remaining watchers stop receiving any watch events, and we will see non-zero
etcd_debugging_mvcc_slow_watcher_totalandetcd_debugging_mvcc_pending_events_totalmetrics.