DNS lookup timeouts due to races in conntrack #3287
Description
What happened?
We are experiencing random 5 second DNS timeouts in our kubernetes cluster.
How to reproduce it?
It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.
Anything else we need to know?
We believe this is a result of a kernel level SNAT race condition that is described quite well here:
https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.
What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.
We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.