Skip to content

Netbird Loses Connection to Peers #3852

Open
@trbutler

Description

@trbutler

Describe the problem

A Netbird peer has internet connectivity and can talk to some of its other peers, but over time the number of available peers degrades. This also seems to harm network routes. Restarting the Netbird service temporarily restores full connectivity.

I've created an Ansible playbook to ping the peers on the network. I run the test from a Linux server that is a peer on the Netbird network that has a reliable half-gig connection to the Internet. Several of the peers it tries to reach are expected to be offline at any given time, but there are others that should always be accessible (they are in quality datacenters with redundant connections).

I've tried the setup with both Network Monitoring enabled and disabled. The problem remains the same.

Here's a table of running those Ansible ping tests. For example see how things vary right after a fresh Netbird restart late last night on the peer doing the testing (Independence), a few hours later, this morning and then again after another fresh restart of Netbird moments ago:

Host After NB Restart Hours Later Next Morning Fresh NB Restart Location
amurmaple.anon-ZDXFz.domain Datacenter #2 (QB)
beatrice.anon-ZDXFz.domain Studio
bigleafmaple.anon-ZDXFz.domain Datacenter #2 (QB)
boaz.anon-ZDXFz.domain Office
cyprus.anon-ZDXFz.domain Datacenter #1 (CA)
falstaff.anon-ZDXFz.domain Home
franklin.anon-ZDXFz.domain Studio
independence.anon-ZDXFz.domain Studio
juniper.anon-ZDXFz.domain Datacenter #1 (CA)
madison.anon-ZDXFz.domain Home
maple.anon-ZDXFz.domain Datacenter #2 (QB)
mesquite.anon-ZDXFz.domain Datacenter #1 (CA)
oberon.anon-ZDXFz.domain Home
rahab.anon-ZDXFz.domain Office
rosalind.anon-ZDXFz.domain Studio
spruce.anon-ZDXFz.domain Datacenter #1 (CA)
sugarmaple.anon-ZDXFz.domain Datacenter #2 (QB)
thomas.anon-ZDXFz.domain Office
touchstone.anon-ZDXFz.domain Studio

Notably, a number of the failed peers, such as Mesquite and Amurmaple are the ones in data centers and their public connections to the Internet remain online even as they fail. Others, such as Franklin, are right next to the server (Independence) that is doing the testing -- those two are on the same switch on the same network. But not all systems on the same network fail (Touchstone is on the same network, for example) nor do all at the datacenter fail (Spruce is actually the bare metal server to which Mesquite is a container). So, I don't see a particular obvious "genre" of hosts that go down versus others that remain up.

Maple and Spruce also utilize a HA network route to access a service (192.168.5.140) which Independence, Franklin, Beatrice and Touchstone are members of the route. It intermittently becomes unavailable, but if I restart netbird on either Maple or Spruce and the four HA members, the route will begin working again.

You'll see Maple is listed as unavailable in the most recent test (coming from Independence), but if I ping it from Oberon, Maple remains available.

Note: All the "office" location systems are offline because of a power outage. Feel free to ignore those, but I wanted to include them in the table just in case such a situation might somehow "ripple" in an unexpected way. When the power is on there, they exhibit a similar situation, where Thomas will become unavailable frequently whereas Boaz remains available most of the time. But, that isn't perfectly consistent: sometimes it is Boaz that goes down and not Thomas.

To Reproduce

Steps to reproduce the behavior:

  1. Ensure full communication with online peers. Run ping test.
  2. Wait several hours.
  3. Run ping test and notice servers at two distinct datacenters are no longer reachable.
  4. Run service netbird restart on the peer doing the test and note the peers that were offline are now reachable again. (Repeat ad infinitum.)

Expected behavior

Peers remain able to talk to each other and to access high availability network routes even if one peer goes offline.

Are you using NetBird Cloud?

Self-hosted 0.44.0

NetBird version

0.44.0 on all but one peer.

Is any other VPN software installed?

No.

Debug output

20250520.log.txt

File key:
1a6ecdff51f59139b215eb9feb49b9dd88a71ef56826d1bdb2744db0448f3680/bf0674ce-c1f1-47ef-a7f1-d4c2900eebe6

Additional context

I'm not sure if this has any relation to the problems Netbird has resuming after sleep on my network's MacOS clients, see issue #2454.

Have you tried these troubleshooting steps?

  • [ x ] Reviewed client troubleshooting (if applicable)
  • [ x ] Checked for newer NetBird versions
  • [ x ] Searched for similar issues on GitHub (including closed ones)
  • [ x ] Restarted the NetBird client
  • [ x ] Disabled other VPN software
  • [ x ] Checked firewall settings

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions