Amplification of pod traffic #3604
Description
I think this is "works as designed". However, I think the side-effects are too great as the cluster grows.
Our peer connection limit is left at the default of 100 and we have 58 nodes. We could reduce that to help minimize the effect but that will only take us so far.
What you expected to happen?
Less broadcasting of traffic / links to not get saturated.
What happened?
Weave forwards traffic that it does not know the destination for to all of its peers. As the cluster grows, the amount of amplified traffic grows even greater.
How to reproduce it?
I think pod scheduling activity and maybe some periodic gossip/discovery phase triggers a peer to clear its table then re-populate it. Based on log output, we're seeing roughly 1-2 peers discovered per second. For a cluster of our size (and growing), that's a relatively large window for traffic to be amplified.
- route changes triggering route invalidation
- all the flows deleted leaving big window for amplification?
Anything else we need to know?
On bare-metal provisioned with kubeadm
Versions:
$ weave version
weave 2.5.1
$ docker version
Client:
Version: 18.06.2-ce
API version: 1.38
Go version: go1.10.3
Git commit: 6d37f41
Built: Sun Feb 10 03:48:06 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.2-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: 6d37f41
Built: Sun Feb 10 03:46:30 2019
OS/Arch: linux/amd64
Experimental: false
$ uname -a
Linux r26c4na 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-13T23:16:01Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:28:14Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Logs:
Our biggest clue came from seeing pod traffic spike in unrelated pods during a large hourly workload in other pods, sometimes saturating the link.
peer-1: Discovered remote MAC ...
peer-2: Discovered remote MAC ...
peer-1: Captured frame from MAC ... associated with peer-2
Evidence that peers are receiving broadcasted traffic during the window where it populates its tables.
Network:
Our network seems fine: no packet loss, low latency, bonded interfaces.
Activity