Description
Title: Performance issues caused by abnormal HTTP/2 SETTINGS frame exchange
We experienced two problems in gRPC communication between our Java client and Python server:
- Java client version: 1.51.0
- Python server version: 1.70
1. HTTP/2 SETTINGS frame exchange issue
When creating the channel in the Java gRPC client, the client must wait for the server to return an HTTP/2 SETTINGS frame before updating the channel state to READY (this is specifically implemented in io.grpc.netty.NettyClientHandler.FrameListener#onSettingsRead
). However, due to a version compatibility problem in our Python gRPC server, the server fails to correctly return the SETTINGS frame, causing the client's channel to remain in the CONNECTING state for a long time.
2. Performance bottleneck analysis
In this state, if we send a large number of RPC requests without deadlines, these requests are accumulated as pending streams and stored in the pendingStreams
property of DelayedClientTransport
. In our environment, about one million such requests were queued.
When the TCP connection fails to establish, io.grpc.internal.DelayedClientTransport#reprocess
is triggered. Inside this method, the system calls pendingStreams.removeAll(toRemove)
. Since toRemove
is implemented as an ArrayList rather than a Set, it results in an O(n²) complexity. When n is very large (e.g., around one million), this blocks the IO thread for around 30 minutes, causing severe IO thread stalls.
Suggestions for optimization
Based on these observations, we propose a couple of potential improvements for grpc-java:
- Introduce a maximum waiting time for the HTTP/2 SETTINGS frame exchange to avoid waiting indefinitely for incompatible server responses.
- Change the
toRemove
collection inDelayedClientTransport#reprocess
to a Set implementation, reducing the complexity from O(n²) to O(n).