Skip to content

NAS-140167 / 27.0.0-BETA.1 / Fix TNC sync_interface_ips empty IPs and repeated concurrent calls#18448

Open
sonicaj wants to merge 1 commit intomasterfrom
NAS-140167
Open

NAS-140167 / 27.0.0-BETA.1 / Fix TNC sync_interface_ips empty IPs and repeated concurrent calls#18448
sonicaj wants to merge 1 commit intomasterfrom
NAS-140167

Conversation

@sonicaj
Copy link
Member

@sonicaj sonicaj commented Mar 12, 2026

This commit fixes an issue where sync_interface_ips could send empty IPs to the TNC account-service (causing 400 errors) and where concurrent netlink events would each independently hit the TNC API with the same payload.

When the HTTP call failed due to empty IPs, the cache was never populated, so every subsequent netlink event retried the same failing call — creating an infinite retry storm. Additionally, a single DHCP renewal would fire 3-5 netlink events, each scheduling a call_later(5), all passing the cache check simultaneously, and all hitting the TNC API concurrently with identical payloads.

An asyncio.Lock serializes concurrent sync_interface_ips calls so only the first performs the HTTP sync while subsequent calls hit the cache and return early. An empty IP guard skips the HTTP call when no IPs are available (static + dynamic combined) but still caches the result to prevent retry storms.

@sonicaj sonicaj requested review from a team and ZackaryWelch March 12, 2026 19:21
@sonicaj sonicaj self-assigned this Mar 12, 2026
@bugclerk bugclerk changed the title Fix TNC sync_interface_ips empty IPs and repeated concurrent calls NAS-140167 / 27.0.0-BETA.1 / Fix TNC sync_interface_ips empty IPs and repeated concurrent calls Mar 12, 2026
@bugclerk
Copy link
Contributor

This commit fixes an issue where sync_interface_ips could send empty IPs to the
TNC account-service (causing 400 errors) and where concurrent netlink events
would each independently hit the TNC API with the same payload.

When the HTTP call failed due to empty IPs, the cache was never populated, so
every subsequent netlink event retried the same failing call — creating an
infinite retry storm. Additionally, a single DHCP renewal would fire 3-5 netlink
events, each scheduling a call_later(5), all passing the cache check
simultaneously, and all hitting the TNC API concurrently with identical payloads.

An asyncio.Lock serializes concurrent sync_interface_ips calls so only the first
performs the HTTP sync while subsequent calls hit the cache and return early. An
empty IP guard skips the HTTP call when no IPs are available (static + dynamic
combined) but still caches the result to prevent retry storms.
Copy link
Contributor

@yocalebo yocalebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now doesn't make any sense to have these as coroutines. Furthermore

  1. you're holding a lock for ALL operations in sync_interface_ips. That doesn't seem right
  2. any other coroutine task that has been scheduled will be queued up indefinitely growing without bounds waiting for this lock to be dropped
  3. if you're wrapping a lock around the entirety of the sync_interface_ips function, then there is no reason to keep this as a coroutine.

Copy link
Contributor

@yocalebo yocalebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I continue to review this and have so many questions. We need to spend a bit more time here on a proper solution.

  1. The 5-second call_later in update_ips is fragile
    The delay is a workaround for Docker interfaces not being registered in interface.internal_interfaces when the netlink event fires. 5 seconds is arbitrary and doesn't guarantee Docker is ready. A better approach would be to debounce — cancel any pending call on each new event so only one sync fires after the storm settles:
  global _pending_sync
  if _pending_sync is not None:
      _pending_sync.cancel()
  _pending_sync = asyncio.get_event_loop().call_later(5, ...)

This at least collapses a burst of netlink events into a single call rather than scheduling N independent ones.

  1. failover.is_single_master_node is only checked in handle_update_ips, not sync_interface_ips sync_interface_ips is also called directly from post_install.py with no failover guard. This means it can run on standby nodes. Is this expected? If so, why?

  2. handle_update_ips fetches tn_connect.config to check status, use_all_interfaces, and interfaces. Then sync_interface_ips fetches it again to check the same fields. These guard checks should be consolidated into sync_interface_ips so the config is fetched once.

  3. The asyncio.Lock wrapping the entire function body is inefficient
    When the lock is held, subsequent callers block and wait rather than returning early. After the first caller finishes and populates the cache, the queued callers each re-fetch config and IPs just to discover the cache is populated. Instead, bail out if the lock is already held — if someone is already syncing, there's no point queuing:

  if _sync_lock.locked():
      return
  async with _sync_lock:
      ...
  1. Cache check should happen before acquiring the lock. Move the cache comparison before _sync_lock so callers that arrive after the cache is populated return immediately without any lock contention.

  2. let's update the docker comment in handle_update_ips. It's poorly worded and took me way too long to understand what it's actually describing.

  # Delay handling to work around a race condition where Docker triggers IP
  # address change events before its bridge interfaces (br-*) are visible to
  # interface.internal_interfaces. Without the delay, the new interface isn't
  # filtered out and causes an unnecessary IP sync with TNC.
  1. FINALLY, since our networking API doesn't allow creating bridge interfaces with hyphens in the name, can we just check to see if the network interface starts with "br-"? If that's the case, then all of this logic can be greatly simplified....

@ZackaryWelch ZackaryWelch self-requested a review March 13, 2026 13:27
@sonicaj
Copy link
Member Author

sonicaj commented Mar 15, 2026

@yocalebo answering queries in the order they have been raised:

  1. Debouncing is a good call, however we still need to keep the lock around as both solve different problems.

  2. Post install is only called when node is setup and at that point there is no failover configured — we are referring to the first boot after an installation. Will still consolidate that check as that is being more consistent.

  3. We can fix querying tnc config from db twice.

  4. Blocking approach is intentional instead of bail-if-locked here. The "bail if locked" approach might not catch updates in this situation:

    a. Caller A gets the lock, sees the IPs as [1.1.1.1] and starts syncing to TNC.
    b. The IP changes to [2.2.2.2], which triggers a new netlink event.
    c. Caller B gets there, sees the lock and quickly leaves.
    d. Caller A finishes and caches [1.1.1.1].
    e. Now, the system has old cached IPs — [2.2.2.2] never gets updated.

    With blocking, Caller B waits, then checks the IPs again (now [2.2.2.2]), sees the cache doesn't match and syncs the update. However, if we add the debounce from point 1, this situation becomes less likely because the burst is smoothed out. But for non-burst events (like a manual IP change soon after a Docker event), blocking is safer.

  5. This is a TOCTOU problem. To compare caches, we need to get the current interface IP addresses first (config fetched at hostname.py:58, IPs at lines 61-64). These same IP addresses are then written to the datastore (lines 84-86) and decide whether the HTTP sync should happen (line 93). The lock makes sure this read-check-write process happens all at once.

    If we move the check outside without copying it inside, a coroutine can read the IP addresses outside the lock. But when it gets the lock, the IP addresses have changed, and it writes old data to the datastore and syncs old data to TNC. For example: coroutine A and B both read IP addresses as [2.2.2.2] outside the lock. A syncs and caches. Meanwhile, the IP addresses change to [3.3.3.3]. B gets the lock but uses its old [2.2.2.2] — writing old IP addresses to both the datastore and TNC.

    If we do double-checked locking (check outside + re-fetch and re-check inside), everything works correctly, but in the burst case — which is exactly what we are trying to improve — all callers see a cache miss on the outside check (no sync has finished yet). They queue up on the lock anyway and then each re-fetches everything inside. This means we end up making double the async calls compared to the current way, but it doesn't give us any extra benefits. The only time it helps is when there is a single, isolated event in steady state where the cache is already full, and it saves one lock that is not being used. But with debouncing from point 1 on top, lock contention drops to almost zero, so it is not worth the extra complexity.

  6. Definitely, will improve the comment.

  7. I think this check is pretty useful to have, will add it. It does not mean that we don't need the above suggestions — I think they still have their place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants