Skip to content

Commit 42a21f2

Browse files
committed
[Darwin] Coalesce mDNS resolve cancel/restart so in-flight results aren't dropped
User-visible failure: on Darwin, every reconnect to a Matter node after ChipDnssdResolveNoLongerNeeded shows ~1s of extra NodeID-resolve latency. Inbound mDNS resolve answers already queued on the dnssd socket get discarded, so the next resolve has to start from scratch. Root cause: when the consumer counter drops to zero we immediately call Finalize -> DNSServiceRefDeallocate. Per the dnssd contract, DNSServiceRefDeallocate discards any events queued on that connection but not yet read. A second observation from the mDNS owner is that "starting and stopping queries doesn't query harder" -- a tight cancel-then-restart for the same instance name is strictly worse than letting the existing query run. Fix: introduce a per-ResolveContext deferred-teardown window (default 500ms) before the actual DNSServiceRefDeallocate. Inside the window: a queued read indicator dispatches the result through DispatchSuccess (which cancels the timer); a new ChipDnssdResolve for the same instance name reuses the existing context and bumps the counter back to 1, skipping DNSServiceCreateConnection / DNSServiceResolve entirely; otherwise the timer fires OnResolveDeferredTeardown -> Finalize(CHIP_ERROR_CANCELLED), preserving the existing failure-path contract upper layers rely on. Carve-out: delegate-based ResolveContexts (callback == nullptr, used by MTRCommissionableBrowser) are NOT subject to deferred teardown. The browser churns OnBrowseAdd/OnBrowseRemove for the same instance name on the order of microseconds while a device is being discovered; holding the underlying DNSServiceRef alive across that churn starves DNSServiceGetAddrInfo of a chance to deliver before the next remove arrives, which manifested as MTRCommissionableBrowserTests/test005 timing out under TSAN. The NodeID reconnect bug this PR fixes is on the callback-based path. Blast radius is confined to the Darwin dnssd platform layer. Tests in src/platform/tests/TestDnssd.cpp pin: - ReusesContextWithinDeferredWindow (callback-based coalescing) - DelegateBasedResolveIsNotDeferred (delegate-based synchronous teardown) - CancelStillPropagatesIfNoInFlightResult (timer fires once if no follow-up) - Multi-sibling rescue, scope mismatch refusal, mismatched-callback rebind, shared-counter invariants, repeated toggle within window, etc. rdar://176263876
1 parent 8a162c6 commit 42a21f2

4 files changed

Lines changed: 1839 additions & 11 deletions

File tree

src/platform/Darwin/dnssd/DnssdContexts.cpp

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -525,13 +525,27 @@ ResolveContext::ResolveContext(DiscoverNodeDelegate * delegate, chip::Inet::IPAd
525525
void ResolveContext::DispatchFailure(const char * errorStr, CHIP_ERROR err)
526526
{
527527
ChipLogError(Discovery, "Mdns: Resolve failure (%s)", errorStr);
528+
// If a deferred teardown timer was scheduled against us, cancel it so it
529+
// doesn't fire after we are deleted.
530+
CancelDeferredTeardownIfScheduled(this);
528531
// Remove before dispatching, so calls back into
529532
// ChipDnssdResolveNoLongerNeeded don't find us and try to also remove us.
530533
bool needDelete = MdnsContexts::GetInstance().RemoveWithoutDeleting(this);
531534

532535
if (nullptr == callback)
533536
{
534-
// Nothing to do.
537+
// Delegate-based ResolveContext (no DnssdResolveCallback). The
538+
// deferred-teardown rescue/coalesce path can dispatch
539+
// CHIP_ERROR_CANCELLED here long after the consumer thinks the
540+
// resolve has gone away, but the delegate API has no failure-signal
541+
// hook. Log so the silent drop is at least observable in the field.
542+
if (err != CHIP_NO_ERROR)
543+
{
544+
ChipLogError(Discovery,
545+
"Mdns: Delegate-based resolve %s failed with %" CHIP_ERROR_FORMAT
546+
" -- no delegate failure callback exists; silently dropping",
547+
instanceName.c_str(), err.Format());
548+
}
535549
}
536550
else
537551
{
@@ -546,6 +560,9 @@ void ResolveContext::DispatchFailure(const char * errorStr, CHIP_ERROR err)
546560

547561
void ResolveContext::DispatchSuccess()
548562
{
563+
// If a deferred teardown timer was scheduled against us, cancel it so it
564+
// doesn't fire after we are deleted.
565+
CancelDeferredTeardownIfScheduled(this);
549566
// Remove before dispatching, so calls back into
550567
// ChipDnssdResolveNoLongerNeeded don't find us and try to also remove us.
551568
bool needDelete = MdnsContexts::GetInstance().RemoveWithoutDeleting(this);

0 commit comments

Comments
 (0)