Netbird Loses Connection to Peers

**Describe the problem**

A Netbird peer has internet connectivity and can talk to some of its other peers, but over time the number of available peers degrades. This also seems to harm network routes. Restarting the Netbird service temporarily restores full connectivity.

I've created an Ansible playbook to ping the peers on the network. I run the test from a Linux server that is a peer on the Netbird network that has a reliable half-gig connection to the Internet. Several of the peers it tries to reach are expected to be offline at any given time, but there are others that should always be accessible (they are in quality datacenters with redundant connections). 

I've tried the setup with both Network Monitoring enabled and disabled. The problem remains the same.

Here's a table of running those Ansible ping tests. For example see how things vary right after a fresh Netbird restart late last night on the peer doing the testing (Independence), a few hours later, this morning and then again after another fresh restart of Netbird moments ago:

Host | After NB Restart | Hours Later  | Next Morning | Fresh NB Restart | Location            |
|-------------------------------|---------------|-----------------|-----------------|---------------|---------------------|
| amurmaple.anon-ZDXFz.domain   | ✅            | ❌              | ❌              | ✅            | Datacenter #2 (QB)  |
| beatrice.anon-ZDXFz.domain    | ✅            | ❌              | ✅              | ✅            | Studio              |
| bigleafmaple.anon-ZDXFz.domain| ❌            | ❌              | ❌              | ❌            | Datacenter #2 (QB)  |
| boaz.anon-ZDXFz.domain        | ❌            | ❌              | ❌              | ❌            | Office              |
| cyprus.anon-ZDXFz.domain      | ✅            | ❌              | ✅              | ✅            | Datacenter #1 (CA)  |
| falstaff.anon-ZDXFz.domain    | ❌            | ❌              | ❌              | ❌            | Home                |
| franklin.anon-ZDXFz.domain    | ✅            | ❌              | ✅              | ✅            | Studio              |
| independence.anon-ZDXFz.domain| ✅            | ✅              | ✅              | ✅            | Studio              |
| juniper.anon-ZDXFz.domain     | ✅            | ✅              | ✅              | ✅            | Datacenter #1 (CA)  |
| madison.anon-ZDXFz.domain     | ✅            | ✅              | ✅              | ✅            | Home                |
| maple.anon-ZDXFz.domain       | ✅            | ✅              | ✅              | ❌            | Datacenter #2 (QB)  |
| mesquite.anon-ZDXFz.domain    | ✅            | ❌              | ❌              | ✅            | Datacenter #1 (CA)  |
| oberon.anon-ZDXFz.domain      | ✅            | ✅              | ❌              | ✅            | Home                |
| rahab.anon-ZDXFz.domain       | ❌            | ❌              | ❌              | ❌            | Office              |
| rosalind.anon-ZDXFz.domain    | ✅            | ✅              | ✅              | ✅            | Studio              |
| spruce.anon-ZDXFz.domain      | ✅            | ✅              | ❌              | ✅            | Datacenter #1 (CA)  |
| sugarmaple.anon-ZDXFz.domain  | ✅            | ✅              | ✅              | ✅            | Datacenter #2 (QB)  |
| thomas.anon-ZDXFz.domain      | ❌            | ❌              | ❌              | ❌            | Office              |
| touchstone.anon-ZDXFz.domain  | ✅            | ✅              | ✅              | ✅            | Studio              |



Notably, a number of the failed peers, such as Mesquite and Amurmaple are the ones in data centers and their public connections to the Internet remain online even as they fail. Others, such as Franklin, are right next to the server (Independence) that is doing the testing -- those two are on the same switch on the same network. But not all systems on the same network fail (Touchstone is on the same network, for example) nor do all at the datacenter fail (Spruce is actually the bare metal server to which Mesquite is a container). So, I don't see a particular obvious "genre" of hosts that go down versus others that remain up.

Maple and Spruce also utilize a HA network route to access a service (192.168.5.140) which Independence, Franklin, Beatrice and Touchstone are members of the route. It intermittently becomes unavailable, but if I restart netbird on either Maple or Spruce _and_ the four HA members, the route will begin working again.

You'll see Maple is listed as unavailable in the most recent test (coming from Independence), but if I ping it from Oberon, Maple remains available. 

Note: All the "office" location systems are offline because of a power outage. Feel free to ignore those, but I wanted to include them in the table just in case such a situation might somehow "ripple" in an unexpected way. When the power is on there, they exhibit a similar situation, where Thomas will become unavailable frequently whereas Boaz remains available most of the time. But, that isn't perfectly consistent: sometimes it is Boaz that goes down and not Thomas. 

**To Reproduce**

Steps to reproduce the behavior:
1. Ensure full communication with online peers. Run ping test.
2. Wait several hours.
3. Run ping test and notice servers at two distinct datacenters are no longer reachable. 
4. Run `service netbird restart` on the peer doing the test and note the peers that were offline are now reachable again. (Repeat ad infinitum.)

**Expected behavior**

Peers remain able to talk to each other and to access high availability network routes even if one peer goes offline.

**Are you using NetBird Cloud?**

Self-hosted 0.44.0 

**NetBird version**

0.44.0 on all but one peer.

**Is any other VPN software installed?**

No.

**Debug output**

[20250520.log.txt](https://github.com/user-attachments/files/20353443/20250520.log.txt)

File key: 
1a6ecdff51f59139b215eb9feb49b9dd88a71ef56826d1bdb2744db0448f3680/bf0674ce-c1f1-47ef-a7f1-d4c2900eebe6

**Additional context**

I'm not sure if this has any relation to the problems Netbird has resuming after sleep on my network's MacOS clients, see issue #2454.

**Have you tried these troubleshooting steps?**
- [ x ] Reviewed [client troubleshooting](https://docs.netbird.io/how-to/troubleshooting-client) (if applicable)
- [ x ] Checked for newer NetBird versions
- [ x ] Searched for similar issues on GitHub (including closed ones)
- [ x ] Restarted the NetBird client
- [ x ] Disabled other VPN software
- [ x ] Checked firewall settings



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Netbird Loses Connection to Peers #3852

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Host	After NB Restart	Hours Later	Next Morning	Fresh NB Restart	Location
amurmaple.anon-ZDXFz.domain	✅	❌	❌	✅	Datacenter #2 (QB)
beatrice.anon-ZDXFz.domain	✅	❌	✅	✅	Studio
bigleafmaple.anon-ZDXFz.domain	❌	❌	❌	❌	Datacenter #2 (QB)
boaz.anon-ZDXFz.domain	❌	❌	❌	❌	Office
cyprus.anon-ZDXFz.domain	✅	❌	✅	✅	Datacenter #1 (CA)
falstaff.anon-ZDXFz.domain	❌	❌	❌	❌	Home
franklin.anon-ZDXFz.domain	✅	❌	✅	✅	Studio
independence.anon-ZDXFz.domain	✅	✅	✅	✅	Studio
juniper.anon-ZDXFz.domain	✅	✅	✅	✅	Datacenter #1 (CA)
madison.anon-ZDXFz.domain	✅	✅	✅	✅	Home
maple.anon-ZDXFz.domain	✅	✅	✅	❌	Datacenter #2 (QB)
mesquite.anon-ZDXFz.domain	✅	❌	❌	✅	Datacenter #1 (CA)
oberon.anon-ZDXFz.domain	✅	✅	❌	✅	Home
rahab.anon-ZDXFz.domain	❌	❌	❌	❌	Office
rosalind.anon-ZDXFz.domain	✅	✅	✅	✅	Studio
spruce.anon-ZDXFz.domain	✅	✅	❌	✅	Datacenter #1 (CA)
sugarmaple.anon-ZDXFz.domain	✅	✅	✅	✅	Datacenter #2 (QB)
thomas.anon-ZDXFz.domain	❌	❌	❌	❌	Office
touchstone.anon-ZDXFz.domain	✅	✅	✅	✅	Studio

Uh oh!

Netbird Loses Connection to Peers #3852

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions