Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

DNS lookup timeouts due to races in conntrack #3287

Open
@dcowden

Description

@dcowden

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions