-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Summary
A single missed TCP keepalive packet sent from Dirk breaks the Vouch gRPC, causing a large number of missed attestations. This is most likely a result of an incomplete fix for gRPC Go for grpc-go issue 6250. Workaround unclear.
Details
The attached tcpdump.txt was captured on the Dirk server side (10.22.0.12), with Dirk binding port 8304. You can see that approximately every 15 seconds a TCP keep alive packet of length 0 sent by port 8304 (dirk) to the connected Vouch client on 10.22.0.4. Usually, Vouch side promptly responds with another 0 length packet. This is done approximately every 15 seconds.
At timestamp 20:45:44.390399 Dirk asks for another keepalive response, which Vouch fails to provide. After a timeout of another 15 seconds Dirk sends a RST packet. The same happens at timestamp 20:47:48.294498 with the Dirk side sending a RST after Vouch failed to reply 15 second prior. Looks like Vouch tries to send another data packet at 20:48:30.004706 and receives another RST.
This is very likely grpc/grpc-go#6250 . See also https://github.com/grpc/grpc-go/blob/master/dialoptions.go#L464-L488 .
I am using Dirk v1.2.1-rc.1, so I should have a gRPC version that was released after the gRPC issue 6250 was closed. I am not sure why the fix isn't effective. Maybe because gRPC Go people only fixed the dialer side and left any problems with accept-ing sockets unfixed?
As a result my Hoodi validator is missing around 1/5 of attestations making the setup unfit for production.
$ journalctl -u podman-vouch-N4-I1 -S 00:00 -U 08:00 | grep "connection reset by peer" | wc
117 3129 39237