weave-npc blocking connections with valid network policy after a period of time (2.6.0) #3764
Description
What you expected to happen?
Similar to #3761, we are seeing traffic being blocked by weave-npc, but we are using network policies. I would expect traffic to not be blocked by NPC with valid network policy in place
What happened?
We have seen now, consistently (about once every 1-2 weeks) traffic gets blocked between pods inside of a namespace where traffic was working fine earlier. After we debugged the issue, and saw the ipset's on the host to not have valid entries for the pods, we restart weave on the host, and the ipsets become populated, and traffic continues to flow.
How to reproduce it?
I wish we had an easy way to consistently reproduce this issue, but we are beginning to see this issue nearly every week within one specific cluster.
Anything else we need to know?
cloud provider: aws
custom built cluster using in house automation.
Versions:
# ./weave version
weave script 2.6.0
# docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:29:52 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:28:22 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683
# uname -a
Linux ip-10-0-173-150 4.15.0-1056-aws #58-Ubuntu SMP Tue Nov 26 15:14:34 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T16:54:35Z", GoVersion:"go1.12.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Logs:
Unfortunately, these logs do not show the weave logs before restart, but when we run into this issue again (in a week or so), we will get those logs and update this issue
https://gist.github.com/naemono/31df744c7ee6b48dba7b554e06553f4b
When this issue is happening, we begin to see a spike in weavenpc_blocked_connections_total
from prometheus:
Activity