Skip to content

Commit 2867bd3

Browse files
woody-appleclaude
andcommitted
fixup: atomic teardown-delay global; finalize non-primary rescued siblings
Round-2 review of the resolve-coalesce fix surfaced two latent bugs the TSAN job and existing tests don't directly exercise: 1. gResolveDeferredTeardownDelay is read on the Matter event-loop thread (ScheduleDeferredTeardownForResolves) and written from gtest fixture threads (Set/TearDown via SetResolveDeferredTeardownDelay). TSAN correctly flags this as a data race. Switch the storage to std::atomic<uint32_t> with relaxed ordering -- value freshness is fine, we just need the read/write itself to be torn-free and synchronized. 2. The Resolve() rescue paths cancel the deferred-teardown timer on every eligible sibling ResolveContext for an instance name but only promote the FIRST as primaryCtx. The remaining siblings stayed alive in MdnsContexts with their own DNSServiceRefs and (after the rebind) would each receive and dispatch the same logical resolve result -- producing duplicate callback invocations and DNSServiceRef leaks until the next ChipDnssdResolveNoLongerNeeded swept them up. Collect non- primary siblings during the scan and Finalize(CHIP_ERROR_CANCELLED) them after the rebind+counter-bump commits, restoring the single- canonical-context invariant the rest of the rescue path relies on. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 46eb253 commit 2867bd3

1 file changed

Lines changed: 43 additions & 4 deletions

File tree

src/platform/Darwin/dnssd/DnssdImpl.cpp

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
#include <lib/support/logging/CHIPLogging.h>
2727
#include <platform/Darwin/UserDefaults.h>
2828

29+
#include <atomic>
2930
#include <set>
3031
#include <string>
3132

@@ -52,7 +53,11 @@ constexpr char kSRPDot[] = "default.service.arpa.";
5253
// not linger.
5354
constexpr chip::System::Clock::Milliseconds32 kDefaultResolveDeferredTeardownDelay = chip::System::Clock::Milliseconds32(500);
5455

55-
chip::System::Clock::Milliseconds32 gResolveDeferredTeardownDelay = kDefaultResolveDeferredTeardownDelay;
56+
// Stored as a raw millisecond count in std::atomic<uint32_t> so the test
57+
// thread's SetResolveDeferredTeardownDelay does not race with the Matter
58+
// event-loop thread's read in ScheduleDeferredTeardownForResolves. TSAN
59+
// flags the unsynchronized non-atomic global access as a data race.
60+
std::atomic<uint32_t> gResolveDeferredTeardownDelayMs{ kDefaultResolveDeferredTeardownDelay.count() };
5661

5762
bool IsSupportedProtocol(DnssdServiceProtocol protocol)
5863
{
@@ -425,6 +430,7 @@ static CHIP_ERROR Resolve(void * context, DnssdResolveCallback callback, uint32_
425430
},
426431
existingResolves);
427432
ResolveContext * primaryCtx = nullptr;
433+
std::vector<ResolveContext *> nonPrimaryToFinalize;
428434
for (auto * item : existingResolves)
429435
{
430436
auto * existingCtx = static_cast<ResolveContext *>(item);
@@ -443,6 +449,18 @@ static CHIP_ERROR Resolve(void * context, DnssdResolveCallback callback, uint32_
443449
{
444450
primaryCtx = existingCtx;
445451
}
452+
else
453+
{
454+
// Non-primary sibling: only ONE rescued context can serve the new
455+
// caller. Leaving the rest alive in MdnsContexts (with their own
456+
// DNSServiceRefs and now-shared-with-primary callback/context after
457+
// the rebind below) would cause duplicate dispatches for a single
458+
// logical Resolve and leak DNSServiceRefs until the next
459+
// ChipDnssdResolveNoLongerNeeded. Finalize them after we commit
460+
// the rescue; we can't Finalize here because Finalize calls
461+
// RemoveWithoutDeleting which would invalidate `existingResolves`.
462+
nonPrimaryToFinalize.push_back(existingCtx);
463+
}
446464
}
447465
if (primaryCtx != nullptr)
448466
{
@@ -458,6 +476,14 @@ static CHIP_ERROR Resolve(void * context, DnssdResolveCallback callback, uint32_
458476
primaryCtx->callback = callback;
459477
primaryCtx->context = context;
460478
}
479+
// Finalize non-primary rescued siblings with CHIP_ERROR_CANCELLED so
480+
// they don't leak. Their counter is shared with primaryCtx and is
481+
// still 0 at this point (Finalize doesn't touch the counter); we bump
482+
// exactly once below for this logical Resolve.
483+
for (auto * sibling : nonPrimaryToFinalize)
484+
{
485+
TEMPORARY_RETURN_IGNORED sibling->Finalize(CHIP_ERROR_CANCELLED);
486+
}
461487
// Bump the shared consumer counter exactly once for this logical
462488
// Resolve call, regardless of how many siblings were rescued.
463489
(*primaryCtx->consumerCounter)++;
@@ -486,6 +512,7 @@ static CHIP_ERROR Resolve(DiscoverNodeDelegate * delegate, uint32_t interfaceId,
486512
},
487513
existingResolves);
488514
ResolveContext * primaryCtx = nullptr;
515+
std::vector<ResolveContext *> nonPrimaryToFinalize;
489516
for (auto * item : existingResolves)
490517
{
491518
auto * existingCtx = static_cast<ResolveContext *>(item);
@@ -504,6 +531,12 @@ static CHIP_ERROR Resolve(DiscoverNodeDelegate * delegate, uint32_t interfaceId,
504531
{
505532
primaryCtx = existingCtx;
506533
}
534+
else
535+
{
536+
// See callback overload for rationale: leftover siblings would
537+
// cause duplicate dispatches and DNSServiceRef leaks.
538+
nonPrimaryToFinalize.push_back(existingCtx);
539+
}
507540
}
508541
if (primaryCtx != nullptr)
509542
{
@@ -514,6 +547,10 @@ static CHIP_ERROR Resolve(DiscoverNodeDelegate * delegate, uint32_t interfaceId,
514547
StringOrNullMarker(name));
515548
primaryCtx->context = delegate;
516549
}
550+
for (auto * sibling : nonPrimaryToFinalize)
551+
{
552+
TEMPORARY_RETURN_IGNORED sibling->Finalize(CHIP_ERROR_CANCELLED);
553+
}
517554
(*primaryCtx->consumerCounter)++;
518555
return CHIP_NO_ERROR;
519556
}
@@ -570,12 +607,12 @@ void OnResolveDeferredTeardown(chip::System::Layer * /* aLayer */, void * aAppSt
570607

571608
chip::System::Clock::Milliseconds32 GetResolveDeferredTeardownDelay()
572609
{
573-
return gResolveDeferredTeardownDelay;
610+
return chip::System::Clock::Milliseconds32(gResolveDeferredTeardownDelayMs.load(std::memory_order_relaxed));
574611
}
575612

576613
void SetResolveDeferredTeardownDelay(chip::System::Clock::Milliseconds32 delay)
577614
{
578-
gResolveDeferredTeardownDelay = delay;
615+
gResolveDeferredTeardownDelayMs.store(delay.count(), std::memory_order_relaxed);
579616
}
580617

581618
namespace {
@@ -601,7 +638,9 @@ void ScheduleDeferredTeardownForResolves(const std::vector<GenericContext *> & r
601638
continue;
602639
}
603640
rctx->deferredTeardownScheduled = true;
604-
auto err = chip::DeviceLayer::SystemLayer().StartTimer(gResolveDeferredTeardownDelay, OnResolveDeferredTeardown, rctx);
641+
auto err = chip::DeviceLayer::SystemLayer().StartTimer(
642+
chip::System::Clock::Milliseconds32(gResolveDeferredTeardownDelayMs.load(std::memory_order_relaxed)),
643+
OnResolveDeferredTeardown, rctx);
605644
if (err != CHIP_NO_ERROR)
606645
{
607646
// SystemLayer::StartTimer returns CriticalFailure, which has no Format(); log

0 commit comments

Comments
 (0)