Weave Net Daemonset fails to restart pod due to existing dummy interface #3414
Description
What you expected to happen?
The Weave Net Daemonset, who controls the weave-net pod in each node, should be able to restart a pod in a node when it fails / stopped for any reason.
What happened?
The weave-net pod gets Error and CrashLoopBack status and unable to function again, until I terminate that node.
How to reproduce it?
SSH into a node, use docker command to kill the weave-net container. Of course, this is just for re-produce. In fact, when on our production cluster, we sometimes meet the situation when weave-net crashes on a node and don't know why.
The logs point out that weave-net fails to create dummy interface:
FATA: 2018/09/26 05:08:04.369497 creating dummy interface: file exists
I have a small investigation, and it looks like the bug comes from net/bridge.go
, in function initPrep
, at those lines:
dummy := &netlink.Dummy{LinkAttrs: netlink.NewLinkAttrs()}
dummy.LinkAttrs.Name = "vethwedu"
if err = netlink.LinkAdd(dummy); err != nil {
return errors.Wrap(err, "creating dummy interface")
}
Before the weave-net starts, it creates a dummy interface object, and when my pod starts, the interface already exists, checked with ip link | grep vethwedu
command:
96782: vethwedu: <BROADCAST,NOARP> mtu 1376 qdisc noop state DOWN mode DEFAULT group default qlen 1000
It looks like in the previous session of weave-net, it fails to delete this dummy interface, or it is killed before deleting it. When I delete the dummy manually with ip link delete vethwedu
, the pod runs smoothly and back to normal.
Adding a small check and delete if the dummy exists before creating a new one would solve this problem. Is it a good solution for this? If that's okay, I'll open an PR.
if existingDummy, err = netlink.LinkByName("vethwedu"); err == nil {
if err := netlink.LinkDel(existingDummy); err != nil {
return errors.Wrap(err, "deleting existing dummy interface")
}
}
//...
Anything else we need to know?
I run our Kubernetes cluster on AWS, using KOPS.
Versions:
$ weave version: 2.4.1
$ docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:09:56 2017
OS/Arch: linux/amd64
$ uname -a
Linux ip-172-50-52-229 4.4.121-k8s #1 SMP Sun Mar 11 19:39:47 UTC 2018 x86_64 GNU/Linux
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Logs:
$ kubectl logs -n kube-system <weave-net-pod> weave
DEBU: 2018/09/26 05:08:04.266053 [kube-peers] Checking peer "7a:1f:1c:b2:b7:7e" against list &{[{ea:43:25:46:a3:95 ip-172-50-61-167.ap-southeast-2.compute.internal} {16:ac:8c:71:55:20 ip-172-50-80-215.ap-southeast-2.compute.internal} {3e:af:c2:26:08:00 ip-172-50-110-136.ap-southeast-2.compute.internal} {2e:04:f7:c2:42:71 ip-172-50-32-17.ap-southeast-2.compute.internal} {32:4d:7c:65:31:8d ip-172-50-50-186.ap-southeast-2.compute.internal} {ae:92:06:fd:b3:e7 ip-172-50-93-2.ap-southeast-2.compute.internal} {7e:69:df:8f:8f:17 ip-172-50-41-197.ap-southeast-2.compute.internal} {62:aa:0a:e8:65:96 ip-172-50-109-73.ap-southeast-2.compute.internal} {f6:30:3e:1d:4b:8b ip-172-50-113-192.ap-southeast-2.compute.internal} {7a:1f:1c:b2:b7:7e ip-172-50-52-229.ap-southeast-2.compute.internal} {12:b7:c5:3d:f4:82 ip-172-50-67-61.ap-southeast-2.compute.internal} {aa:0d:6b:9c:56:9b ip-172-50-44-14.ap-southeast-2.compute.internal} {82:e9:f1:ce:c5:29 ip-172-50-58-155.ap-southeast-2.compute.internal} {26:a8:11:0d:76:e2 ip-172-50-33-242.ap-southeast-2.compute.internal}]}
INFO: 2018/09/26 05:08:04.297726 Command line options: map[ipalloc-range:100.96.0.0/11 port:6783 docker-api: expect-npc:true host-root:/host nickname:ip-172-50-52-229.ap-southeast-2.compute.internal conn-limit:100 db-prefix:/weavedb/weave-net http-addr:127.0.0.1:6784 metrics-addr:0.0.0.0:6782 name:7a:1f:1c:b2:b7:7e datapath:datapath ipalloc-init:consensus=14 no-dns:true]
INFO: 2018/09/26 05:08:04.297772 weave 2.3.0
FATA: 2018/09/26 05:08:04.369497 creating dummy interface: file exists