Skip to content

Dirk/Vouch gRPC settings unreliable by default #81

@catwith1hat

Description

@catwith1hat

Summary

A single missed TCP keepalive packet sent from Dirk breaks the Vouch gRPC, causing a large number of missed attestations. This is most likely a result of an incomplete fix for gRPC Go for grpc-go issue 6250. Workaround unclear.

Details

The attached tcpdump.txt was captured on the Dirk server side (10.22.0.12), with Dirk binding port 8304. You can see that approximately every 15 seconds a TCP keep alive packet of length 0 sent by port 8304 (dirk) to the connected Vouch client on 10.22.0.4. Usually, Vouch side promptly responds with another 0 length packet. This is done approximately every 15 seconds.

At timestamp 20:45:44.390399 Dirk asks for another keepalive response, which Vouch fails to provide. After a timeout of another 15 seconds Dirk sends a RST packet. The same happens at timestamp 20:47:48.294498 with the Dirk side sending a RST after Vouch failed to reply 15 second prior. Looks like Vouch tries to send another data packet at 20:48:30.004706 and receives another RST.

This is very likely grpc/grpc-go#6250 . See also https://github.com/grpc/grpc-go/blob/master/dialoptions.go#L464-L488 .

I am using Dirk v1.2.1-rc.1, so I should have a gRPC version that was released after the gRPC issue 6250 was closed. I am not sure why the fix isn't effective. Maybe because gRPC Go people only fixed the dialer side and left any problems with accept-ing sockets unfixed?

As a result my Hoodi validator is missing around 1/5 of attestations making the setup unfit for production.

$ journalctl -u podman-vouch-N4-I1 -S 00:00 -U 08:00 | grep "connection reset by peer" | wc
    117    3129   39237

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions