NAS-140167 / 27.0.0-BETA.1 / Fix TNC sync_interface_ips empty IPs and repeated concurrent calls#18448
NAS-140167 / 27.0.0-BETA.1 / Fix TNC sync_interface_ips empty IPs and repeated concurrent calls#18448
Conversation
src/middlewared/middlewared/plugins/truenas_connect/hostname.py
Outdated
Show resolved
Hide resolved
This commit fixes an issue where sync_interface_ips could send empty IPs to the TNC account-service (causing 400 errors) and where concurrent netlink events would each independently hit the TNC API with the same payload. When the HTTP call failed due to empty IPs, the cache was never populated, so every subsequent netlink event retried the same failing call — creating an infinite retry storm. Additionally, a single DHCP renewal would fire 3-5 netlink events, each scheduling a call_later(5), all passing the cache check simultaneously, and all hitting the TNC API concurrently with identical payloads. An asyncio.Lock serializes concurrent sync_interface_ips calls so only the first performs the HTTP sync while subsequent calls hit the cache and return early. An empty IP guard skips the HTTP call when no IPs are available (static + dynamic combined) but still caches the result to prevent retry storms.
yocalebo
left a comment
There was a problem hiding this comment.
This now doesn't make any sense to have these as coroutines. Furthermore
- you're holding a lock for ALL operations in
sync_interface_ips. That doesn't seem right - any other coroutine task that has been scheduled will be queued up indefinitely growing without bounds waiting for this lock to be dropped
- if you're wrapping a lock around the entirety of the sync_interface_ips function, then there is no reason to keep this as a coroutine.
There was a problem hiding this comment.
I continue to review this and have so many questions. We need to spend a bit more time here on a proper solution.
- The 5-second call_later in update_ips is fragile
The delay is a workaround for Docker interfaces not being registered in interface.internal_interfaces when the netlink event fires. 5 seconds is arbitrary and doesn't guarantee Docker is ready. A better approach would be to debounce — cancel any pending call on each new event so only one sync fires after the storm settles:
global _pending_sync
if _pending_sync is not None:
_pending_sync.cancel()
_pending_sync = asyncio.get_event_loop().call_later(5, ...)
This at least collapses a burst of netlink events into a single call rather than scheduling N independent ones.
-
failover.is_single_master_node is only checked in handle_update_ips, not sync_interface_ips sync_interface_ips is also called directly from post_install.py with no failover guard. This means it can run on standby nodes. Is this expected? If so, why?
-
handle_update_ips fetches tn_connect.config to check status, use_all_interfaces, and interfaces. Then sync_interface_ips fetches it again to check the same fields. These guard checks should be consolidated into sync_interface_ips so the config is fetched once.
-
The asyncio.Lock wrapping the entire function body is inefficient
When the lock is held, subsequent callers block and wait rather than returning early. After the first caller finishes and populates the cache, the queued callers each re-fetch config and IPs just to discover the cache is populated. Instead, bail out if the lock is already held — if someone is already syncing, there's no point queuing:
if _sync_lock.locked():
return
async with _sync_lock:
...
-
Cache check should happen before acquiring the lock. Move the cache comparison before _sync_lock so callers that arrive after the cache is populated return immediately without any lock contention.
-
let's update the docker comment in handle_update_ips. It's poorly worded and took me way too long to understand what it's actually describing.
# Delay handling to work around a race condition where Docker triggers IP
# address change events before its bridge interfaces (br-*) are visible to
# interface.internal_interfaces. Without the delay, the new interface isn't
# filtered out and causes an unnecessary IP sync with TNC.
- FINALLY, since our networking API doesn't allow creating bridge interfaces with hyphens in the name, can we just check to see if the network interface starts with "br-"? If that's the case, then all of this logic can be greatly simplified....
|
@yocalebo answering queries in the order they have been raised:
|
This commit fixes an issue where sync_interface_ips could send empty IPs to the TNC account-service (causing 400 errors) and where concurrent netlink events would each independently hit the TNC API with the same payload.
When the HTTP call failed due to empty IPs, the cache was never populated, so every subsequent netlink event retried the same failing call — creating an infinite retry storm. Additionally, a single DHCP renewal would fire 3-5 netlink events, each scheduling a call_later(5), all passing the cache check simultaneously, and all hitting the TNC API concurrently with identical payloads.
An asyncio.Lock serializes concurrent sync_interface_ips calls so only the first performs the HTTP sync while subsequent calls hit the cache and return early. An empty IP guard skips the HTTP call when no IPs are available (static + dynamic combined) but still caches the result to prevent retry storms.