Skip to content

Conversation

@gamerslouis
Copy link
Contributor

@gamerslouis gamerslouis commented Jan 6, 2026

Describe your changes

NetBird clients rely on the management service and signal service to establish and maintain connections.
When the management service is temporary unavailable, a NetBird client is unable to re-establish existing connections after an ICE disconnect.

Introduce a new environment variable option: NB_KEEP_CONNECTION_ON_MANAGEMENT_DOWN.

When this flag is enabled and the management service is unavailable:

  • The client will not remove peers after an ICE disconnect.
  • If a network route cannot select a new peer, the client will retain the existing peer instead of removing it.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

Summary by CodeRabbit

  • Bug Fixes

    • Improved connection stability during management server maintenance by preserving active endpoints and routes under specific conditions, preventing unnecessary disconnections when the management service is temporarily unavailable.
  • New Features

    • Added configuration option to keep connections alive during management downtime scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

The changes introduce a feature flag NB_KEEP_CONNECTION_ON_MANAGEMENT_DOWN that conditionally preserves network endpoints and routes when the management connection is down. A new environment variable reader function is added, and endpoint/route removal logic is guarded in connection and route management code based on this flag and connection state.

Changes

Cohort / File(s) Summary
Environment Configuration
client/internal/peer/env.go
Adds exported constant EnvKeepConnectionOnMgmtDown and new function IsKeepConnectionOnMgmtDown() to read the feature flag from environment, with logging support.
Connection & Route Management
client/internal/peer/conn.go, client/internal/routemanager/client/client.go
Adds conditional guards that skip endpoint removal (onICEStateDisconnected, onRelayDisconnected) and route removal (recalculateRoutes) when the feature flag is enabled and management/signal states are not both connected.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • pappz

Poem

🐰 A flag arrives to save the day,
When management's far, connections stay!
Endpoints preserved, routes held tight,
Through cloudy downs, we keep the light. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: adding an option to prevent peer and route removal when management services are unavailable.
Description check ✅ Passed The description adequately explains the changes, motivation, and behavior of the new environment variable feature, with the checklist properly completed.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 6, 2026

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @client/internal/peer/env.go:
- Line 24: The comment above the function incorrectly references
isConnectionKeepOnMgmtDown; update the comment to use the actual exported
function name IsKeepConnectionOnMgmtDown so the comment matches the symbol,
e.g., change the comment text to mention IsKeepConnectionOnMgmtDown to avoid the
name mismatch with the function.
🧹 Nitpick comments (4)
client/internal/routemanager/client/client.go (1)

339-343: Consider simplifying the boolean condition for readability.

The condition !(A && B) && C can be rewritten as (!A || !B) && C using De Morgan's law, which may be slightly more readable:

🔎 Suggested simplification
-		if !(w.statusRecorder.GetManagementState().Connected && w.statusRecorder.GetSignalState().Connected) && peer.IsKeepConnectionOnMgmtDown() {
+		if (!w.statusRecorder.GetManagementState().Connected || !w.statusRecorder.GetSignalState().Connected) && peer.IsKeepConnectionOnMgmtDown() {
			log.Warnf("No available routes for network [%v], keep current route %s", w.handler, w.currentChosen.Peer)
			return nil
		}
client/internal/peer/env.go (1)

24-33: Consider logging the environment variable value only once.

The function logs an info message every time it's called when the environment variable is set. Since this function is invoked from multiple locations (route recalculation, ICE disconnection, relay disconnection), it may produce duplicate log entries during connection flapping scenarios.

🔎 Suggested optimization using sync.Once
+var (
+	keepConnectionOnMgmtDownOnce sync.Once
+	keepConnectionOnMgmtDown     bool
+)
+
 // IsKeepConnectionOnMgmtDown checks if peers and routes should be kept when management connection is down
 func IsKeepConnectionOnMgmtDown() bool {
-	stickyOnManagementDownEnv := os.Getenv(EnvKeepConnectionOnMgmtDown)
-	if stickyOnManagementDownEnv == "" {
-		return false
-	}
-
-	log.Infof("peers will be kept on failure as %s is set to %s", EnvKeepConnectionOnMgmtDown, stickyOnManagementDownEnv)
-	return strings.ToLower(stickyOnManagementDownEnv) == "true"
+	keepConnectionOnMgmtDownOnce.Do(func() {
+		stickyOnManagementDownEnv := os.Getenv(EnvKeepConnectionOnMgmtDown)
+		if stickyOnManagementDownEnv != "" {
+			keepConnectionOnMgmtDown = strings.ToLower(stickyOnManagementDownEnv) == "true"
+			if keepConnectionOnMgmtDown {
+				log.Infof("peers will be kept on failure as %s is set to %s", EnvKeepConnectionOnMgmtDown, stickyOnManagementDownEnv)
+			}
+		}
+	})
+	return keepConnectionOnMgmtDown
 }
client/internal/peer/conn.go (2)

437-442: Consider simplifying the boolean condition for consistency and readability.

Similar to the route manager code, this condition could be rewritten for improved readability:

🔎 Suggested simplification
-		if (conn.statusRecorder.GetManagementState().Connected && conn.statusRecorder.GetSignalState().Connected) || !IsKeepConnectionOnMgmtDown() {
+		mgmtConnected := conn.statusRecorder.GetManagementState().Connected
+		signalConnected := conn.statusRecorder.GetSignalState().Connected
+		if (mgmtConnected && signalConnected) || !IsKeepConnectionOnMgmtDown() {
			if err := conn.config.WgConfig.WgInterface.RemoveEndpointAddress(conn.config.WgConfig.RemoteKey); err != nil {
				conn.Log.Errorf("failed to remove wg endpoint: %v", err)
			}
		}

536-542: Consider extracting the duplicated endpoint removal logic.

This guard condition and endpoint removal logic is identical to lines 437-442 in onICEStateDisconnected. Extracting it to a helper method would reduce duplication and make future changes easier to maintain.

🔎 Suggested refactor to extract common logic

Add a helper method to the Conn struct:

// removeEndpointIfAllowed removes the WireGuard endpoint address unless
// management services are unavailable and NB_KEEP_CONNECTION_ON_MANAGEMENT_DOWN is set
func (conn *Conn) removeEndpointIfAllowed() {
	mgmtConnected := conn.statusRecorder.GetManagementState().Connected
	signalConnected := conn.statusRecorder.GetSignalState().Connected
	
	if (mgmtConnected && signalConnected) || !IsKeepConnectionOnMgmtDown() {
		if err := conn.config.WgConfig.WgInterface.RemoveEndpointAddress(conn.config.WgConfig.RemoteKey); err != nil {
			conn.Log.Errorf("failed to remove wg endpoint: %v", err)
		}
	}
}

Then use it in both locations:

 		conn.currentConnPriority = conntype.None

-		// Prevent removing peer endpoint if management connection is down and NB_KEEP_CONNECTION_ON_MANAGEMENT_DOWN is set
-		if (conn.statusRecorder.GetManagementState().Connected && conn.statusRecorder.GetSignalState().Connected) || !IsKeepConnectionOnMgmtDown() {
-			if err := conn.config.WgConfig.WgInterface.RemoveEndpointAddress(conn.config.WgConfig.RemoteKey); err != nil {
-				conn.Log.Errorf("failed to remove wg endpoint: %v", err)
-			}
-		}
+		conn.removeEndpointIfAllowed()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb4fc3 and e57423f.

📒 Files selected for processing (3)
  • client/internal/peer/conn.go
  • client/internal/peer/env.go
  • client/internal/routemanager/client/client.go
🧰 Additional context used
🧬 Code graph analysis (2)
client/internal/routemanager/client/client.go (2)
client/internal/peer/env.go (1)
  • IsKeepConnectionOnMgmtDown (25-33)
client/internal/engine.go (1)
  • Peer (217-220)
client/internal/peer/conn.go (1)
client/internal/peer/env.go (1)
  • IsKeepConnectionOnMgmtDown (25-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: release_ui_darwin
  • GitHub Check: release
  • GitHub Check: JS / Lint
  • GitHub Check: iOS / Build
  • GitHub Check: Android / Build
  • GitHub Check: FreeBSD Port / Build & Test
  • GitHub Check: release_ui
  • GitHub Check: Client / Unit
  • GitHub Check: Build Cache
  • GitHub Check: Windows
  • GitHub Check: Darwin
  • GitHub Check: Linux
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit

return strings.EqualFold(os.Getenv(EnvKeyNBForceRelay), "true")
}

// isConnectionKeepOnMgmtDown checks if peers and routes should be kept when management connection is down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the function name mismatch in the comment.

The comment references isConnectionKeepOnMgmtDown but the actual function name is IsKeepConnectionOnMgmtDown.

🔎 Proposed fix
-// isConnectionKeepOnMgmtDown checks if peers and routes should be kept when management connection is down
+// IsKeepConnectionOnMgmtDown checks if peers and routes should be kept when management connection is down
 func IsKeepConnectionOnMgmtDown() bool {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// isConnectionKeepOnMgmtDown checks if peers and routes should be kept when management connection is down
// IsKeepConnectionOnMgmtDown checks if peers and routes should be kept when management connection is down
func IsKeepConnectionOnMgmtDown() bool {
🤖 Prompt for AI Agents
In @client/internal/peer/env.go at line 24, The comment above the function
incorrectly references isConnectionKeepOnMgmtDown; update the comment to use the
actual exported function name IsKeepConnectionOnMgmtDown so the comment matches
the symbol, e.g., change the comment text to mention IsKeepConnectionOnMgmtDown
to avoid the name mismatch with the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant