Skip to content

Run protect-subnets dispatcher script via no-wait.d#679

Closed
agners wants to merge 1 commit into
hassio-addons:mainfrom
agners:protect-subnets-no-wait
Closed

Run protect-subnets dispatcher script via no-wait.d#679
agners wants to merge 1 commit into
hassio-addons:mainfrom
agners:protect-subnets-no-wait

Conversation

@agners

@agners agners commented Apr 30, 2026

Copy link
Copy Markdown

Proposed Changes

The protect-subnets dispatcher script can block for several minutes waiting on Supervisor's API readiness. nm-dispatcher serializes Action() processing per device, which means a long-running script in dispatcher.d/ holds back NM's pre-up Action() reply and stalls the device in ip-check state. Combined with Supervisor waiting on the activation to complete, this manifests as a 10+ minute startup hang when any pre-up.d/ script is also present (e.g. nm-cloud-setup.sh from the Alpine networkmanager package).

Move the script into no-wait.d/ with a top-level symlink, per the NetworkManager-dispatcher(8) recommendation. Scripts in no-wait.d/ run in parallel and don't block the per-device queue, so other dispatcher events (notably pre-up) are no longer held up by our supervisor poll.

Related Issues

Fixes #678

Summary by CodeRabbit

  • New Features
    • Added NetworkManager integration to automatically protect subnet routes during network state transitions and DHCP address changes.
    • Includes safety mechanisms that halt the system if route protection fails, preventing unintended network loss.

The protect-subnets dispatcher script can block for several minutes
waiting on Supervisor's API readiness. nm-dispatcher serializes Action()
processing per device, which means a long-running script in dispatcher.d/
holds back NM's pre-up Action() reply and stalls the device in ip-check
state. Combined with Supervisor waiting on the activation to complete,
this manifests as a 10+ minute startup hang when any pre-up.d/ script
is also present (e.g. nm-cloud-setup.sh from the Alpine networkmanager
package).

Move the script into no-wait.d/ with a top-level symlink, per the
NetworkManager-dispatcher(8) recommendation. Scripts in no-wait.d/ run
in parallel and don't block the per-device queue, so other dispatcher
events (notably pre-up) are no longer held up by our supervisor poll.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown

Walkthrough

This pull request adds two new NetworkManager dispatcher files to fix a blocking behavior during Supervisor restarts. A dispatcher directive and a corresponding non-blocking script are introduced to handle subnet route protection without delaying NetworkManager lifecycle events.

Changes

Cohort / File(s) Summary
NetworkManager Dispatcher Configuration
tailscale/rootfs/etc/NetworkManager/dispatcher.d/protect-subnets
Single-line dispatcher directive registering the no-wait script path for subnet protection actions.
NetworkManager Dispatcher Script
tailscale/rootfs/etc/NetworkManager/dispatcher.d/no-wait.d/protect-subnets
Event handler script that reacts to NetworkManager lifecycle and DHCP address-change events by unprotecting and re-protecting subnet routes; includes conditional logic for DHCP changes and error handling with exit codes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested labels

bugfix, priority-critical

Suggested reviewers

  • frenck

Poem

🐰 A hop and a skip through NetworkManager's flow,
No-wait scripts now protect the routes below,
No more blocking when Supervisor starts anew,
The Tailscale warren restarts faster—thank you! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: moving the protect-subnets dispatcher script into no-wait.d/.
Linked Issues check ✅ Passed The PR correctly implements the fix for issue #678 by moving the protect-subnets script to no-wait.d/ to prevent blocking NetworkManager's per-device queue.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the blocking dispatcher script issue; no extraneous modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tailscale/rootfs/etc/NetworkManager/dispatcher.d/no-wait.d/protect-subnets`:
- Around line 29-32: The command substitutions comparing protect-subnet-routes
and unprotect-subnet-routes ignore failures and also use an invalid argument
"tested" for unprotect-subnet-routes; update the logic to invoke both commands
with the supported "test" argument (or no argument for unprotect if you prefer
real mode), capture each command's exit status before comparing outputs, and
only perform the comparison if both commands succeeded; also replace the
incorrect "tested" invocation with "test" (or remove the extra argument) and log
or handle errors when either command fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f3a78822-78a3-46b0-96c9-6c725c9748cb

📥 Commits

Reviewing files that changed from the base of the PR and between 7a10013 and dbd6450.

📒 Files selected for processing (3)
  • tailscale/rootfs/etc/NetworkManager/dispatcher.d/no-wait.d/protect-subnets
  • tailscale/rootfs/etc/NetworkManager/dispatcher.d/protect-subnets
  • tailscale/rootfs/etc/NetworkManager/dispatcher.d/protect-subnets

@agners

agners commented Apr 30, 2026

Copy link
Copy Markdown
Author

This should be fairly safe. But what long term might be a better idea is to have the script touch a file or something and have the/a s6 service do the unprotect/protect things. But that is a bit bigger rework.

@lmagyar

lmagyar commented May 1, 2026

Copy link
Copy Markdown
Collaborator

This should be fairly safe. But what long term might be a better idea is to have the script touch a file or something and have the/a s6 service do the unprotect/protect things. But that is a bit bigger rework.

As I see, this would soft brick 1000s of devices. Marking it draft, give me some time to analyse it...

@lmagyar lmagyar marked this pull request as draft May 1, 2026 08:44
@lmagyar

lmagyar commented May 1, 2026

Copy link
Copy Markdown
Collaborator

As I see:

  • the "when any pre script is present, that can potentially deadlock NM and SU, becasue it makes blocking all the other non-blocking non-pre scripts" issue is a bug in NM
  • currently NM-dispatcher inside this app/add-on runs scripts sequentially, without (dead)blocking NM/SU
  • the PR doesn't address the recommendation "If your script might take arbitrarily long to complete, you should spawn a child process and have the parent return immediately"
    • or it is not a problem with no-wait scripts??? I really don't know
  • the PR will change this sequentiallity to parallel script execution
    • the subnet protection scripts are not designed for parallel execution, and due to the slow sleep 5 inside them, there is a non-zero chance they will manipulate ip rule-s in parallel, leading to an unpredictable mess in routing, and if this protection is not active, devices will be soft-bricked
    • background: if eg. 192.168.1.x is the local LAN subnet and TS has a subent route for this on another location (quite typical "misconfiguration", achievable with 1 click on TS's admin page, enabling the "other" subnet route), TS here will route all traffic, even HA's local IP's traffic to the other site, and HA will lose it's accessibility immediately - yes, this can be reversed with the same 1 click (disabling that colliding subnet route), but people will not know the root cause; this is a known issue with TS, known for years, they refuse to add a flag to protect local subnets, or any viable solution

So:

  • option 1 - do nothing:
    • I don't see why we should not let NM-dispatcher inside this app/add-on to run scripts sequentially? it has no effect on SU (if the NM bug is not triggered)
    • Though I should check any timeout's inside NM-dispatcher.
  • option 2 - modify this PR:
    • If we make the NM-dispatcher scripts parallel, we should create a local queue for these events, or at least a mutually exclusive lock (maybe the order is not important, but we must prevent parallel execution)
    • option 2A - locking
      • we should spawn the whole current protect-subnets NM-dispatcher script into a child process
      • s6-setlock can be used for mutually exclusive lock at the beginning of these child processes
      • hmmm, quite simple solution, they will start parallel, but won't do anything parallel
      • though I'm not 100% sure the order is not important, dhcp changes do less than interface up/down events, I can't come up with a counter-example now, but I'm not sure it can't be created
    • option 2B - queueing
      • s6 fifodir or a plain pipe into a new longrun service, and that service processes sequentially the events pushed through this pipe from a minimal NM-dispatcher script
      • this way we achive the same sequentiallity as we have now out of the box from NM-dispatcher

I like the simplicity of option 2A (child process & locking), but I'm not 100% sure it is robust enough because the event order will be changed. :(

So if option 1 (do nothing) is a no-go (I'm stil not conviced we must change it), I vote for option 2B (replicate the event queue ourselves).

@lmagyar

lmagyar commented May 2, 2026

Copy link
Copy Markdown
Collaborator

Nevermind, I've started to implement the internal queue version.

@lmagyar

lmagyar commented May 3, 2026

Copy link
Copy Markdown
Collaborator

I've created PR #680 with a new separate s6 listener service to execute the slow things.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. This pull request has been automatically marked as stale because of that and will be closed if no further activity occurs within 7 days. Thank you for your contributions.

@github-actions github-actions Bot added the stale There has not been activity on this issue or PR for quite some time. label Jun 3, 2026
@github-actions github-actions Bot closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale There has not been activity on this issue or PR for quite some time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tailscale add-on causes Supervisor restart to block 10 minutes

2 participants