Description
Describe the problem
A Netbird peer has internet connectivity and can talk to some of its other peers, but over time the number of available peers degrades. This also seems to harm network routes. Restarting the Netbird service temporarily restores full connectivity.
I've created an Ansible playbook to ping the peers on the network. I run the test from a Linux server that is a peer on the Netbird network that has a reliable half-gig connection to the Internet. Several of the peers it tries to reach are expected to be offline at any given time, but there are others that should always be accessible (they are in quality datacenters with redundant connections).
I've tried the setup with both Network Monitoring enabled and disabled. The problem remains the same.
Here's a table of running those Ansible ping tests. For example see how things vary right after a fresh Netbird restart late last night on the peer doing the testing (Independence), a few hours later, this morning and then again after another fresh restart of Netbird moments ago:
Host | After NB Restart | Hours Later | Next Morning | Fresh NB Restart | Location |
---|---|---|---|---|---|
amurmaple.anon-ZDXFz.domain | ✅ | ❌ | ❌ | ✅ | Datacenter #2 (QB) |
beatrice.anon-ZDXFz.domain | ✅ | ❌ | ✅ | ✅ | Studio |
bigleafmaple.anon-ZDXFz.domain | ❌ | ❌ | ❌ | ❌ | Datacenter #2 (QB) |
boaz.anon-ZDXFz.domain | ❌ | ❌ | ❌ | ❌ | Office |
cyprus.anon-ZDXFz.domain | ✅ | ❌ | ✅ | ✅ | Datacenter #1 (CA) |
falstaff.anon-ZDXFz.domain | ❌ | ❌ | ❌ | ❌ | Home |
franklin.anon-ZDXFz.domain | ✅ | ❌ | ✅ | ✅ | Studio |
independence.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Studio |
juniper.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Datacenter #1 (CA) |
madison.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Home |
maple.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ❌ | Datacenter #2 (QB) |
mesquite.anon-ZDXFz.domain | ✅ | ❌ | ❌ | ✅ | Datacenter #1 (CA) |
oberon.anon-ZDXFz.domain | ✅ | ✅ | ❌ | ✅ | Home |
rahab.anon-ZDXFz.domain | ❌ | ❌ | ❌ | ❌ | Office |
rosalind.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Studio |
spruce.anon-ZDXFz.domain | ✅ | ✅ | ❌ | ✅ | Datacenter #1 (CA) |
sugarmaple.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Datacenter #2 (QB) |
thomas.anon-ZDXFz.domain | ❌ | ❌ | ❌ | ❌ | Office |
touchstone.anon-ZDXFz.domain | ✅ | ✅ | ✅ | ✅ | Studio |
Notably, a number of the failed peers, such as Mesquite and Amurmaple are the ones in data centers and their public connections to the Internet remain online even as they fail. Others, such as Franklin, are right next to the server (Independence) that is doing the testing -- those two are on the same switch on the same network. But not all systems on the same network fail (Touchstone is on the same network, for example) nor do all at the datacenter fail (Spruce is actually the bare metal server to which Mesquite is a container). So, I don't see a particular obvious "genre" of hosts that go down versus others that remain up.
Maple and Spruce also utilize a HA network route to access a service (192.168.5.140) which Independence, Franklin, Beatrice and Touchstone are members of the route. It intermittently becomes unavailable, but if I restart netbird on either Maple or Spruce and the four HA members, the route will begin working again.
You'll see Maple is listed as unavailable in the most recent test (coming from Independence), but if I ping it from Oberon, Maple remains available.
Note: All the "office" location systems are offline because of a power outage. Feel free to ignore those, but I wanted to include them in the table just in case such a situation might somehow "ripple" in an unexpected way. When the power is on there, they exhibit a similar situation, where Thomas will become unavailable frequently whereas Boaz remains available most of the time. But, that isn't perfectly consistent: sometimes it is Boaz that goes down and not Thomas.
To Reproduce
Steps to reproduce the behavior:
- Ensure full communication with online peers. Run ping test.
- Wait several hours.
- Run ping test and notice servers at two distinct datacenters are no longer reachable.
- Run
service netbird restart
on the peer doing the test and note the peers that were offline are now reachable again. (Repeat ad infinitum.)
Expected behavior
Peers remain able to talk to each other and to access high availability network routes even if one peer goes offline.
Are you using NetBird Cloud?
Self-hosted 0.44.0
NetBird version
0.44.0 on all but one peer.
Is any other VPN software installed?
No.
Debug output
File key:
1a6ecdff51f59139b215eb9feb49b9dd88a71ef56826d1bdb2744db0448f3680/bf0674ce-c1f1-47ef-a7f1-d4c2900eebe6
Additional context
I'm not sure if this has any relation to the problems Netbird has resuming after sleep on my network's MacOS clients, see issue #2454.
Have you tried these troubleshooting steps?
- [ x ] Reviewed client troubleshooting (if applicable)
- [ x ] Checked for newer NetBird versions
- [ x ] Searched for similar issues on GitHub (including closed ones)
- [ x ] Restarted the NetBird client
- [ x ] Disabled other VPN software
- [ x ] Checked firewall settings