Description
There is a longstanding issue over at distribution/distribution#785 where users reported connection resets trying to push to an AWS-hosted registry from inside the AWS network. After months, we've finally narrowed this down to a bad interaction between spurious TCP retransmits and the NAT rules that Docker sets up for bridge networking.
Here is a summary of what happens:
- For some reason, when an AWS EC2 machine connects to itself using its external-facing IP address, there are occasional packets with sequence numbers and timestamps that are far behind the rest.
- Normally these packets would be ignored as spurious retransmits. However, because the packets fall outside the TCP window, Linux's conntrack module marks them invalid, and their destination addresses do not get rewritten by DNAT.
- The packets are eventually interpreted as packets destined to the actual address/port in the IP/TCP headers. Since there is no flow matching these, the host sends a RST.
- The RST terminates the actual NAT'd connection, since its source address and port matches the NAT'd connection.
I think it would be hugely helpful for libnetwork to include a workaround for this. It has affected a lot of users trying to use the registry in AWS, and it presumably affects other Dockerized applications as well. While I'll reach out to AWS to point out the spurious retransmits, I don't know if they'll be able to fix them, and there may also be other environments with similar issues.
I've found two possible workarounds:
- Turn on conntrack's "be liberal" flag:
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
. This causes conntrack/NAT to treat packets outside the TCP window as part of the flow being tracked, instead of marking them invalid and causing them to be handled by the host. - Add a rule to drop invalid packets instead of allowing them to trigger RSTs:
iptables -I INPUT -m conntrack --ctstate INVALID -j DROP
Both of these can potentially affect non-Docker traffic. The former causes NAT to forward packets that it would otherwise err on the side of not forwarding, which seems relatively harmless, but it's a system-level setting, so it's not limited to Docker flows. The latter would drop any packets that conntrack deems invalid, system-wide, unless we added specific destination filters for the addresses/ports that Docker set up NAT rules for, which could add overhead.
It may be too late to hope for a workaround to be included in Docker 1.11, but anything we can do on this front will really improve the lives of Docker users on AWS.