Run protect-subnets dispatcher script via no-wait.d#679
Conversation
The protect-subnets dispatcher script can block for several minutes waiting on Supervisor's API readiness. nm-dispatcher serializes Action() processing per device, which means a long-running script in dispatcher.d/ holds back NM's pre-up Action() reply and stalls the device in ip-check state. Combined with Supervisor waiting on the activation to complete, this manifests as a 10+ minute startup hang when any pre-up.d/ script is also present (e.g. nm-cloud-setup.sh from the Alpine networkmanager package). Move the script into no-wait.d/ with a top-level symlink, per the NetworkManager-dispatcher(8) recommendation. Scripts in no-wait.d/ run in parallel and don't block the per-device queue, so other dispatcher events (notably pre-up) are no longer held up by our supervisor poll. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughThis pull request adds two new NetworkManager dispatcher files to fix a blocking behavior during Supervisor restarts. A dispatcher directive and a corresponding non-blocking script are introduced to handle subnet route protection without delaying NetworkManager lifecycle events. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tailscale/rootfs/etc/NetworkManager/dispatcher.d/no-wait.d/protect-subnets`:
- Around line 29-32: The command substitutions comparing protect-subnet-routes
and unprotect-subnet-routes ignore failures and also use an invalid argument
"tested" for unprotect-subnet-routes; update the logic to invoke both commands
with the supported "test" argument (or no argument for unprotect if you prefer
real mode), capture each command's exit status before comparing outputs, and
only perform the comparison if both commands succeeded; also replace the
incorrect "tested" invocation with "test" (or remove the extra argument) and log
or handle errors when either command fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f3a78822-78a3-46b0-96c9-6c725c9748cb
📒 Files selected for processing (3)
tailscale/rootfs/etc/NetworkManager/dispatcher.d/no-wait.d/protect-subnetstailscale/rootfs/etc/NetworkManager/dispatcher.d/protect-subnetstailscale/rootfs/etc/NetworkManager/dispatcher.d/protect-subnets
|
This should be fairly safe. But what long term might be a better idea is to have the script touch a file or something and have the/a s6 service do the unprotect/protect things. But that is a bit bigger rework. |
As I see, this would soft brick 1000s of devices. Marking it draft, give me some time to analyse it... |
|
As I see:
So:
I like the simplicity of option 2A (child process & locking), but I'm not 100% sure it is robust enough because the event order will be changed. :( So if option 1 (do nothing) is a no-go (I'm stil not conviced we must change it), I vote for option 2B (replicate the event queue ourselves). |
|
Nevermind, I've started to implement the internal queue version. |
|
I've created PR #680 with a new separate s6 listener service to execute the slow things. |
|
There hasn't been any activity on this pull request recently. This pull request has been automatically marked as stale because of that and will be closed if no further activity occurs within 7 days. Thank you for your contributions. |
Proposed Changes
The protect-subnets dispatcher script can block for several minutes waiting on Supervisor's API readiness. nm-dispatcher serializes Action() processing per device, which means a long-running script in dispatcher.d/ holds back NM's pre-up Action() reply and stalls the device in ip-check state. Combined with Supervisor waiting on the activation to complete, this manifests as a 10+ minute startup hang when any pre-up.d/ script is also present (e.g. nm-cloud-setup.sh from the Alpine networkmanager package).
Move the script into no-wait.d/ with a top-level symlink, per the NetworkManager-dispatcher(8) recommendation. Scripts in no-wait.d/ run in parallel and don't block the per-device queue, so other dispatcher events (notably pre-up) are no longer held up by our supervisor poll.
Related Issues
Fixes #678
Summary by CodeRabbit