Skip to content

[client] Feature/lazy connection #3379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 108 commits into
base: main
Choose a base branch
from
Open

[client] Feature/lazy connection #3379

wants to merge 108 commits into from

Conversation

pappz
Copy link
Contributor

@pappz pappz commented Feb 24, 2025

Describe your changes

With the lazy connection feature, the peer will connect to target peers on-demand. The trigger can be any IP traffic.

This feature can be enabled with the NB_ENABLE_EXPERIMENTAL_LAZY_CONN environment variable.

When the engine receives a network map, it binds a free UDP port for every remote peer, and the system configures WireGuard endpoints for these ports. When traffic appears on a UDP socket, the system removes this listener and starts the peer connection procedure immediately.

Key changes

  • Fix slow netbird status -d command
  • Move from engine.go file to conn_mgr.go the peer connection related code
  • Refactor the iface interface usage and moved interface file next to the engine code
  • Add new command line flag and UI option to enable feature
  • The peer.Conn struct is reusable after it has been closed.

Issue ticket number and link

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

@Lauszus
Copy link

Lauszus commented Apr 24, 2025

@pappz let me know if you need help testing. I have some IoT devices were I had to disable Netbird again, as it was consuming too much data, so I'm really interesting in helping this PR along :)

pappz added 2 commits April 25, 2025 12:46
Check the minimum version number of the remote peer, and if the remote peer does not support the lazy connection, then always use a permanent connection.

Keep in mind just because the user updated the agent, the management server will not push a new network map with the new version information.
@pappz
Copy link
Contributor Author

pappz commented Apr 25, 2025

@pappz let me know if you need help testing. I have some IoT devices were I had to disable Netbird again, as it was consuming too much data, so I'm really interesting in helping this PR along :)

@Lauszus, thank you!
The feature is ready for review. In the meantime, I am working on bug fixes based on feedback. I would like to highlight that this feature also requires an update to the Management server; without it, we cannot establish a backward-compatible lazy connection. By default, this feature is disabled. You can enable it using an environment variable or command-line options.

@Lauszus
Copy link

Lauszus commented Apr 25, 2025

@pappz let me know if you need help testing. I have some IoT devices were I had to disable Netbird again, as it was consuming too much data, so I'm really interesting in helping this PR along :)

@Lauszus, thank you!

The feature is ready for review. In the meantime, I am working on bug fixes based on feedback. I would like to highlight that this feature also requires an update to the Management server; without it, we cannot establish a backward-compatible lazy connection. By default, this feature is disabled. You can enable it using an environment variable or command-line options.

No problem. I can enable it and test it on my side. I have ~50 devices where I can try it on.

pappz and others added 7 commits May 2, 2025 12:51
…ng (#3747)

Fix the exclude list handling. When the agent gets a new network map, it manages the changes in the excluded peers.

Cases:

Extend the exclude list with a new peer: start permanent peer connection
Remove a peer from the exclude list: we do not know the actual state of the peer connection, so we must suppose it is not in "idle". Start the inactivity listener for peer.
If a peer is newly created in the management server and it will be in exclude list (i.e. it is a router) then the manager can not start the connection immediately because the peer.Conn is not exist yet. It is normal and the engine code will create the peer and start the connection in the next step.
fix idle state managmenet
Validate the threshold env variable
If the remote peer initiates the connection, we just reset the inactivity listener, but do not start it. As a result, after a disconnection, our peer never switches back to the idle state.

For every peer, we instantiate an inactivity listener. It is a reusable timer, but it must start on a go routine after it has been stopped. The key challenge is to ensure that we never start the timer multiple times and always clean it up properly when the peer is in the idle state or it has been set to the exclusion list.
Copy link
Contributor

@lixmal lixmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description doesn't match code anymore (env vars)

if d.isClosed {
d.peerCfg.Log.Debugf("exit from activity listener")
} else {
d.peerCfg.Log.Errorf("failed to read from activity listener: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not retry (continue) reading if there's an error != closed? Maybe with a ctx check in-between

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With it, we risk a high load loop. If it fails to read from UDP once, what makes us think it will succeed the second time?

Copy link
Contributor

@lixmal lixmal May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With it, we risk a high load loop.

True

If it fails to read from UDP once, what makes us think it will succeed the second time?

It could be a temporary error, e.g. OS ran out of file descriptors.
But yeah, in general this should be rare. Maybe we can add a comment here explaining the implication for the time being?

pappz added 8 commits May 15, 2025 12:53
Add an option to the management proto to enable or disable the lazy connection feature on the peer side. Command-line options and environment variables override the management configuration.

- When the feature is enabled: Start the lazy connection manager and add all peers to the manager in the "active" st-ate.

- When the feature is disabled: Stop the lazy connection manager and establish connections to all peers in the store.

Fix deadlock

The conn_mgr is protected by a mutex. However, because some internal functions are called via callbacks from the lazy connection manager, a deadlock can occur. The solution is to completely eliminate the callbacks.
lixmal
lixmal previously approved these changes May 15, 2025
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 New issue
1 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants