First Thread subscription fails immediately when accessory not yet operational#72268
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves cold-start behavior for Thread devices on Darwin by deferring the first subscription attempt (when there is no cached subscription IP) until either an operational advertisement is observed or a 1-second watchdog fires, avoiding an immediate “No route to host” failure and wasted backoff time.
Changes:
- Add a one-shot “first Thread subscribe” deferral gate in
MTRDevice_Concretethat waits for operational advertisement or a 1s watchdog before scheduling subscription pool work. - Cancel/clear deferral state on invalidate and controller suspend, and add an early-cancel path in
nodeMayBeAdvertisingOperationalto trigger subscription immediately. - Add XCTest coverage for watchdog firing, advertisement cancellation, re-entry behavior, suspend/invalidate/remove safety, and timing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/darwin/Framework/CHIP/MTRDevice_Concrete.mm | Implements first-Thread-subscribe deferral, watchdog scheduling, and advertisement-driven cancelation; clears state on invalidate/suspend. |
| src/darwin/Framework/CHIPTests/MTRDeviceTests.m | Adds regression tests validating the deferral gate, watchdog timing, and race/safety behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request implements a deferral mechanism for the initial subscription attempt on Thread devices when no cached IP address is available, mitigating 'No-route-to-host' failures during cold starts by waiting for an operational mDNS advertisement or a 1-second watchdog timer. It also introduces comprehensive regression tests. The feedback recommends accessing backing instance variables directly (e.g., _deferringFirstThreadSubscription) instead of using properties via self. while holding _lock to avoid KVO re-entrancy or deadlocks. Additionally, it suggests dispatching nodeMayBeAdvertisingOperational calls to the Matter queue in several test cases to prevent potential assertion failures on threads where the chip stack lock is required.
| if (self.deferringFirstThreadSubscription) { | ||
| MTR_LOG("%@ - first Thread subscribe deferral still in flight; ignoring re-entry (%@)", self, reason); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Inside class implementations, especially when holding a lock (_lock), it is highly recommended to access backing instance variables directly (e.g., _deferringFirstThreadSubscription) instead of using property getters/setters via self.. This avoids potential Key-Value Observing (KVO) re-entrancy or deadlocks if external observers are registered, and bypasses the overhead of Objective-C message dispatch inside critical sections.
if (_deferringFirstThreadSubscription) {
MTR_LOG("%@ - first Thread subscribe deferral still in flight; ignoring re-entry (%@)", self, reason);
return;
}
| BOOL shouldDeferForThreadColdstart = (!_lastSubscriptionIPAddress.has_value() | ||
| && !self.deferringFirstThreadSubscription | ||
| && !self.hasDeferredFirstThreadSubscription); | ||
| if (shouldDeferForThreadColdstart) { | ||
| MTR_LOG("%@ - deferring first Thread subscribe up to 1s; waiting for operational advertisement", self); | ||
| self.deferringFirstThreadSubscription = YES; | ||
| self.hasDeferredFirstThreadSubscription = YES; |
There was a problem hiding this comment.
For the same reasons of avoiding KVO re-entrancy under _lock and improving performance, please use direct instance variable access (_deferringFirstThreadSubscription and _hasDeferredFirstThreadSubscription) here.
BOOL shouldDeferForThreadColdstart = (!_lastSubscriptionIPAddress.has_value()
&& !_deferringFirstThreadSubscription
&& !_hasDeferredFirstThreadSubscription);
if (shouldDeferForThreadColdstart) {
MTR_LOG("%@ - deferring first Thread subscribe up to 1s; waiting for operational advertisement", self);
_deferringFirstThreadSubscription = YES;
_hasDeferredFirstThreadSubscription = YES;
| if (!self.deferringFirstThreadSubscription) { | ||
| // Operational advertisement already triggered the subscribe. | ||
| return; | ||
| } | ||
| self.deferringFirstThreadSubscription = NO; |
There was a problem hiding this comment.
| self.deferringFirstThreadSubscription = NO; | ||
| self.hasDeferredFirstThreadSubscription = NO; |
| if (self.deferringFirstThreadSubscription) { | ||
| self.deferringFirstThreadSubscription = NO; |
| self.deferringFirstThreadSubscription = NO; | ||
| self.hasDeferredFirstThreadSubscription = NO; |
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); | ||
| XCTAssertTrue([device respondsToSelector:nodeMayBeAdvertisingOperationalSel]); | ||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Warc-performSelector-leaks" | ||
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | ||
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | ||
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | ||
| #pragma clang diagnostic pop |
There was a problem hiding this comment.
In test062, you correctly noted that nodeMayBeAdvertisingOperational requires the chip stack lock to be held (via assertChipStackLockedByCurrentThread()) and thus dispatched the call to the Matter queue using asyncDispatchToMatterQueue:. However, in test052 (and similarly in test055, test056, and test059), nodeMayBeAdvertisingOperational is called directly on the test thread. This can lead to assertion failures or crashes in environments where assertChipStackLockedByCurrentThread() is active (such as debug or CI builds). To ensure test robustness and consistency, consider dispatching these calls to the Matter queue using asyncDispatchToMatterQueue: and waiting for them to complete, just like in test062.
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); | |
| XCTAssertTrue([device respondsToSelector:nodeMayBeAdvertisingOperationalSel]); | |
| #pragma clang diagnostic push | |
| #pragma clang diagnostic ignored "-Warc-performSelector-leaks" | |
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | |
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | |
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | |
| #pragma clang diagnostic pop | |
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); | |
| XCTAssertTrue([device respondsToSelector:nodeMayBeAdvertisingOperationalSel]); | |
| for (NSString * label in @[ @"1", @"2", @"3" ]) { | |
| XCTestExpectation * fired = [self expectationWithDescription:[NSString stringWithFormat:@"adv-%@ dispatched", label]]; | |
| [device.deviceController asyncDispatchToMatterQueue:^{ | |
| #pragma clang diagnostic push | |
| #pragma clang diagnostic ignored "-Warc-performSelector-leaks" | |
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | |
| #pragma clang diagnostic pop | |
| [fired fulfill]; | |
| } errorHandler:^(NSError * _Nonnull error) { | |
| XCTFail(@"adv-%@: failed to dispatch to matter queue: %@", label, error); | |
| [fired fulfill]; | |
| }]; | |
| [self waitForExpectations:@[ fired ] timeout:5]; | |
| } |
There was a problem hiding this comment.
(FYI: This is Claude) Validated against code: false positive. test052 (line 6540), test055 (line 6699), test056 (line 6804), and test059 (line 7143) all already invoke nodeMayBeAdvertisingOperational via [device.deviceController asyncDispatchToMatterQueue:^{...}] with an XCTestExpectation, exactly the pattern you suggest. test062 isn't unique; it shares that pattern with the rest. The two genuine direct-thread call sites were test077_AdvertisementClearsDeferral and test079_OneShotLatch_PreventsReDeferral — those have just been fixed in ee3a559.
|
PR #72268: Size comparison from ff181cb to 7cbb460 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72268 +/- ##
==========================================
+ Coverage 55.52% 55.57% +0.04%
==========================================
Files 1630 1630
Lines 111127 111223 +96
Branches 13418 13409 -9
==========================================
+ Hits 61706 61812 +106
+ Misses 49421 49411 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7cbb460 to
87ebede
Compare
|
PR #72268: Size comparison from 4894db9 to c607ebe Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
c607ebe to
447c1ab
Compare
| @"Latch must be set the moment the deferral arms"); | ||
|
|
||
| // Fire the advertisement well before the 1s watchdog can reach its | ||
| // deadline. nodeMayBeAdvertisingOperational is declared on | ||
| // MTRDevice_Concrete; dispatch via the runtime to avoid C++ headers. | ||
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); |
There was a problem hiding this comment.
(FYI: This is Claude) Fixed in ee3a559: test077_AdvertisementClearsDeferral (renamed from test_AdvertisementClearsDeferral) now wraps the nodeMayBeAdvertisingOperational invocation in [device.deviceController asyncDispatchToMatterQueue:^{...}] with an XCTestExpectation, matching the pattern in test052/055/056/059/062. Confirmed assertChipStackLockedByCurrentThread() at MTRDevice_Concrete.mm:1547 — your read was correct.
| // First cycle: arm the gate and clear it via advertisement. | ||
| [device setDelegate:delegate queue:queue]; | ||
| XCTAssertEqualObjects([device valueForKey:@"deferringFirstThreadSubscription"], @YES); | ||
| XCTAssertEqualObjects([device valueForKey:@"hasDeferredFirstThreadSubscription"], @YES); | ||
|
|
There was a problem hiding this comment.
(FYI: This is Claude) Fixed in ee3a559: test079_OneShotLatch_PreventsReDeferral (renamed from test_OneShotLatch_PreventsReDeferral) now wraps the nodeMayBeAdvertisingOperational invocation in [device.deviceController asyncDispatchToMatterQueue:^{...}] with an XCTestExpectation. Confirmed assertChipStackLockedByCurrentThread() at MTRDevice_Concrete.mm:1547 — your read was correct.
24826ed to
447c1ab
Compare
|
PR #72268: Size comparison from 4894db9 to 4121a4d Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
4121a4d to
b851b87
Compare
| // Fire the advertisement well before the 1s watchdog can reach its | ||
| // deadline. nodeMayBeAdvertisingOperational is declared on | ||
| // MTRDevice_Concrete; dispatch via the runtime to avoid C++ headers. | ||
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); | ||
| XCTAssertTrue([device respondsToSelector:nodeMayBeAdvertisingOperationalSel]); | ||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Warc-performSelector-leaks" | ||
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | ||
| #pragma clang diagnostic pop |
There was a problem hiding this comment.
(FYI: This is Claude) Fixed in ee3a559: test077_AdvertisementClearsDeferral (renamed from test_AdvertisementClearsDeferral) now wraps the nodeMayBeAdvertisingOperational invocation in [device.deviceController asyncDispatchToMatterQueue:^{...}] with an XCTestExpectation, matching the pattern in test052/055/056/059/062. Confirmed assertChipStackLockedByCurrentThread() at MTRDevice_Concrete.mm:1547 — your read was correct.
| SEL nodeMayBeAdvertisingOperationalSel = NSSelectorFromString(@"nodeMayBeAdvertisingOperational"); | ||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Warc-performSelector-leaks" | ||
| [device performSelector:nodeMayBeAdvertisingOperationalSel]; | ||
| #pragma clang diagnostic pop |
There was a problem hiding this comment.
(FYI: This is Claude) Fixed in ee3a559: test079_OneShotLatch_PreventsReDeferral (renamed from test_OneShotLatch_PreventsReDeferral) now wraps the nodeMayBeAdvertisingOperational invocation in [device.deviceController asyncDispatchToMatterQueue:^{...}] with an XCTestExpectation. Confirmed assertChipStackLockedByCurrentThread() at MTRDevice_Concrete.mm:1547 — your read was correct.
| - (void)test_NonThreadDevice_NoDeferral | ||
| { | ||
| __auto_type * staleDevice = [MTRDevice deviceWithNodeID:@(kDeviceId1) controller:sController]; | ||
| [sController removeDevice:staleDevice]; | ||
|
|
There was a problem hiding this comment.
(FYI: This is Claude) Fixed in ee3a559: renamed all 13 newly added test_* methods to follow the existing testNNN_ convention — test074_NonThreadDevice_NoDeferral through test086_advertisement_during_deferral_with_removed_delegate (test073 was the previous high-water mark). Cross-references in comments and an NSLog tag updated.
b851b87 to
9651d6b
Compare
9651d6b to
f339612
Compare
|
PR #72268: Size comparison from 8a162c6 to 3e69187 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
3e69187 to
e57cae6
Compare
|
PR #72268: Size comparison from 8a162c6 to afe3497 Full report (5 builds for cc32xx, realtek, stm32)
|
|
PR #72268: Size comparison from 8a162c6 to 629c150 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from 8a162c6 to c5a3f14 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from 8a162c6 to ab24387 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
PR #72268: Size comparison from 8a162c6 to cad2fb3 Full report (16 builds for cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
PR #72268: Size comparison from 8a162c6 to 9a86b68 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
PR #72268: Size comparison from 8a162c6 to 56dc8de Full report (28 builds for bl602, bl616, bl702, bl702l, cc32xx, efr32, nrfconnect, psoc6, realtek, stm32, telink)
|
andy31415
left a comment
There was a problem hiding this comment.
(FYI: this is Gemini reviewing)
Thank you for this contribution. During our automated static analysis of PR #72268, we identified three minor actionable findings concerning comment-code inconsistencies, test suite execution coverage, and lock-scope property access risks in MTRDevice_Concrete.mm.
| // Matter-queue snapshot to decide rather than touching readClient | ||
| // here on self.queue (which would violate | ||
| // assertChipStackLockedByCurrentThread()). | ||
| if (hadLiveReadClient) { |
There was a problem hiding this comment.
(FYI: this is Gemini)
Comment vs Code early-return inconsistency
The inline comment explicitly states that if the invariant is violated, we should NOT silently drop the advertisement signal. However, the code performs an early return; (line 1548), skipping the ensure-subscription path entirely. Please update the code to let it fall through, or update the comment if dropping the signal is indeed intended.
There was a problem hiding this comment.
(FYI: This is Claude) Addressed in 5e4367b.
The early-return is the intended behavior here: when the deferral is armed but a live ReadClient already exists, the live ReadClient owns the resubscribe path (CASE retries + the SDK's own operational-advertisement handling), so kicking off another _ensureSubscription would race against it. The comment used to claim the opposite ("we do NOT silently drop the advertisement signal") — that was wrong. The comment now states the drop is intentional and explains why, with the loud MTR_LOG_ERROR retained so the invariant violation stays surfaced for diagnosability. Thanks for catching the mismatch.
| // firing nodeMayBeAdvertisingOperational must NOT stand up a subscription. | ||
| // The advertisement-driven async block on self.queue must observe the | ||
| // missing delegate via the new _delegateExists guard and abort. | ||
| - (void)test086_advertisement_during_deferral_with_removed_delegate |
There was a problem hiding this comment.
(FYI: this is Gemini)
Unfinished regression test case test086
The test sets up the deferral state and asserts that the deferral is armed, but then immediately calls invalidate and cleans up. It never actually drives the operational advertisement path (nodeMayBeAdvertisingOperational) or waits for the 1s watchdog to fire, leaving the core assertion logic unexecuted.
There was a problem hiding this comment.
(FYI: This is Claude) Addressed in 5e4367b.
test086 now has a real body that exercises the _delegateExists guard end-to-end:
- Arms the deferral on a fresh Thread device.
- Calls
removeDelegate:(without invalidate) so the device has no observers but the deferral remains armed. - Dispatches
-nodeMayBeAdvertisingOperationalto the Matter queue viaasyncDispatchToMatterQueue:(same wrapper pattern test077 / test079 use, since the selector assertsassertChipStackLockedByCurrentThread()). - Drains the device queue, then asserts
unitTestDeferringFirstThreadSubscription == NO— i.e. the guard atMTRDevice_Concrete.mm:1532fired, cleared the flag, bumped the watchdog generation, and returned without standing up a subscription. - Waits past the 1s watchdog deadline and re-asserts the flag is still NO, so we know no phantom subscription leaked behind the guard.
Renamed to test086_AdvertisementAfterDelegateRemoved_AbandonsViaDelegateExistsGuard to reflect what it actually exercises. Thanks for the catch — the test was indeed unfinished as written.
| } inNanoseconds:0 description:description]; | ||
| // Self-clear the stored handle on fire; we're the only | ||
| // consumer and there's nothing left to cancel. | ||
| self.firstThreadSubscribeWatchdogBlock = nil; |
There was a problem hiding this comment.
(FYI: this is Gemini)
ObjC Property Setter Triggering KVO Under Lock
Setting self.firstThreadSubscribeWatchdogBlock = nil while holding an active _lock scope synchronously triggers Objective-C Key-Value Observing (KVO) notification callbacks. This introduces a severe risk of re-entrant deadlocks if any observer makes calls back into the device object. Consider assigning directly to the underlying instance variable instead.
There was a problem hiding this comment.
(FYI: This is Claude on Woody's behalf) Fixed in 381c20c — switched the lone self.firstThreadSubscribeWatchdogBlock = nil write to the backing ivar self->_firstThreadSubscribeWatchdogBlock = nil, matching the pattern used by every other write to this state (the other six call sites already used the ivar; this one was the lone holdout). Comment expanded to document why we avoid the property setter under _lock. Good catch — a synchronous KVO observer that re-entered MTRDevice would have deadlocked on _lock.
| [device setDelegate:delegate queue:queue]; | ||
| NSNumber * gen1 = [device valueForKey:@"firstThreadSubscribeWatchdogGeneration"]; | ||
| XCTAssertNotNil(gen1, @"firstThreadSubscribeWatchdogGeneration must be exposed via KVC"); | ||
| XCTAssertGreaterThan([gen1 unsignedLongLongValue], 0ULL, |
| @"Arm 1: setDelegate must arm the deferral"); | ||
| NSNumber * gen1 = [device valueForKey:@"firstThreadSubscribeWatchdogGeneration"]; | ||
| XCTAssertGreaterThan([gen1 unsignedLongLongValue], 0ULL, | ||
| @"Arm 1: generation must advance from 0 on fresh arm"); | ||
|
|
| @"Pre-reentry: latch must be set"); | ||
|
|
||
| NSNumber * genBefore = [device valueForKey:@"firstThreadSubscribeWatchdogGeneration"]; | ||
|
|
||
| // Re-enter _ensureSubscriptionForExistingDelegates during the deferral |
|
Quick FYI for anyone parsing the Copilot review comments on
The property declaration at Pinging because Copilot left the comments on the file but pointed at unrelated lines — wanted the human reviewers to know before they chased the wrong thread. |
|
PR #72268: Size comparison from cb6d219 to 5e4367b Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
=== Posted from Claude ===
|
|
PR #72268: Size comparison from 3e1a6f8 to 7ae3532 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
PR #72268: Size comparison from 7f8ee55 to 3c3b5ed Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from 706b9c6 to 20a750c Full report (6 builds for cc32xx, nrfconnect, realtek, stm32)
|
|
PR #72268: Size comparison from c55b996 to 7e0206e Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from c55b996 to f934110 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
@andy31415 thanks for forwarding the Gemini FYI comments. I took a pass at the three you flagged and all three appear to be stale against the current tip (f934110):
Re-requesting your review in case you want to take another pass — happy to re-engage Gemini if you want me to trigger a fresh review on the current tip. |
|
PR #72268: Size comparison from c55b996 to a52ecba Full report (32 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from c55b996 to 6d6b8ea Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from f1767a8 to a487f04 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
|
PR #72268: Size comparison from f1767a8 to de59da1 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
|
PR #72268: Size comparison from 652855f to e3ea852 Full report (5 builds for cc32xx, realtek, stm32)
|
|
PR #72268: Size comparison from 9cf1485 to 0801863 Full report (25 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32)
|
…sement seen
On a cold start where the Home app foregrounds before the Thread interface
has finished coming up, _delegateAdded -> _ensureSubscriptionForExistingDelegates
-> _setupSubscriptionWithReason fires the very first SubscribeRequest before
threadradiod's [WED_START] completes. The send fails immediately with
"OS Error 0x02000041: No route to host" (errno 65) from
UDPEndPointImplSockets.cpp. We then back off and retry, wasting 3-6 seconds
of coldstart on a doomed attempt.
Gate the first subscription attempt: if _deviceUsesThread is true and we have
no cached _lastSubscriptionIPAddress, defer scheduling the pool work until
either (a) nodeMayBeAdvertisingOperational fires for this device -- the strong
signal that the Thread interface is up and the accessory is reachable -- or
(b) a 1-second watchdog elapses -- the safety net for devices that never
advertise during the deferral window.
Implementation:
- New deferringFirstThreadSubscription / hasDeferredFirstThreadSubscription
properties on MTRDevice_Concrete. The first is YES only during the
deferral window; the second is a one-shot latch so subsequent re-entries
of _ensureSubscriptionForExistingDelegates (notably the one
nodeMayBeAdvertisingOperational triggers after clearing the deferral)
don't re-enter the deferral path.
- _ensureSubscriptionForExistingDelegates factors the existing pool-work
scheduling out into a local block scheduleSubscriptionPoolWork that the
deferral watchdog reuses verbatim.
- Gate short-circuits when we have evidence the device is already
reachable: a cached _lastSubscriptionIPAddress, or a most-recent report
timestamp within the last 60s. Either signal means the Thread interface
is up and there is no benefit to paying the 1s watchdog penalty.
- nodeMayBeAdvertisingOperational, when it observes a pending first-Thread
deferral, clears the flag and re-enters _ensureSubscriptionForExistingDelegates
with reason "operational advertisement seen" instead of going through the
normal _triggerResubscribeWithReason path (which would no-op because we
aren't yet in MTRInternalDeviceStateSubscribing).
- 1s watchdog runs on the device queue, takes _lock, double-checks the
deferral flag (the advertisement path may have cleared it racingly), and
if still set proceeds with the original scheduling.
- A monotonically-increasing firstThreadSubscribeWatchdogGeneration counter
nonces the watchdog block. dispatch_after is uncancellable, so a stale
watchdog from a prior arm cycle (e.g. after controllerSuspended ->
controllerResumed re-armed the deferral) would otherwise fire against a
freshly re-armed deferral and prematurely collapse the 1s safety window.
Every clear path (advertisement, invalidate, controllerSuspended) and
every fresh arm bumps the generation; the watchdog block captures
armedGeneration and bails on mismatch.
Only Thread devices are affected; Wi-Fi/Ethernet devices retain the existing
immediate behavior. In the worst case (device that never advertises), the
first subscribe is delayed by exactly 1s vs. today's "fire immediately, fail
in 3-6s, then retry" behavior -- still a substantial coldstart win.
Adds behavior tests covering the deferral, the advertisement-clears-deferral
path, the 1s watchdog, the invalidate-during-deferral safety path, the
stale-watchdog-vs-re-armed-deferral race, and hardening tests for the
Thread-interface-ready short-circuit and latch persistence across
removeDelegate/setDelegate cycles on the same instance.
The cached-IP property is std::optional<chip::Inet::IPAddress> and is not
KVC-bridgeable; tests use DEBUG-only helpers
(unitTestSetLastSubscriptionIPAddressFromString:,
unitTestHasCachedLastSubscriptionIPAddress) that parse via
chip::Inet::IPAddress::FromString and inspect has_value() under _lock.
Test infrastructure: snapshot atomic block-property to local before
nil-check + invoke in MTRDeviceTestDelegate.m. Defensive fix for the
classic ObjC race where the property getter's block-copy semantics race
the writer's reset to nil -- addresses upstream project-chip#39292 root cause pattern
for all sites without disabling coverage. Each of the 16 dispatch sites
(onReachable, onNotReachable, onInternalStateChanged,
onAttributeDataReceived, onEventDataReceived, onReportBegin, onReportEnd,
onDeviceCachePrimed, onDeviceConfigurationChanged,
onSubscriptionPoolDequeue, onSubscriptionPoolWorkComplete,
onClusterDataPersisted, onSubscriptionCallbackDelete, onSubscriptionReset,
onUTCTimeSet, onTimeSynchronizationLossDetected) now performs a single
atomic load + retain into a local before the nil-check, matching the
pattern Jeff Tung used to fix the same hazard in project-chip#41411.
Summary
SubscribeRequestfromMTRDeviceis sent before threadradiod's[WED_START]completes. The send fails immediately withOS Error 0x02000041: No route to host(errno 65) fromUDPEndPointImplSockets.cpp, then we back off and retry, wasting 3-6 seconds of cold-start time on a doomed attempt._deviceUsesThreadis true and we have no cached_lastSubscriptionIPAddress, defer scheduling the pool work until either (a)nodeMayBeAdvertisingOperationalfires for this device — the strong signal that the Thread interface is up and the accessory is reachable — or (b) a 1-second watchdog elapses as a safety net for devices that never advertise during the deferral window.Implementation notes
deferringFirstThreadSubscription/hasDeferredFirstThreadSubscriptionproperties onMTRDevice_Concrete. The first is YES only during the deferral window; the second is a one-shot latch so subsequent re-entries of_ensureSubscriptionForExistingDelegates(notably the onenodeMayBeAdvertisingOperationaltriggers after clearing the deferral) don't re-enter the deferral path._ensureSubscriptionForExistingDelegatesfactors the existing pool-work scheduling out into a local block that the deferral watchdog reuses verbatim.nodeMayBeAdvertisingOperational, when it observes a pending first-Thread deferral, clears the flag and re-enters_ensureSubscriptionForExistingDelegateswith reason "operational advertisement seen" instead of going through the normal_triggerResubscribeWithReasonpath (which would no-op because we aren't yet inMTRInternalDeviceStateSubscribing)._lock, double-checks the deferral flag (the advertisement path may have cleared it racingly), and if still set proceeds with the original scheduling.Testing
MTRDeviceTests.mcover:nodeMayBeAdvertisingOperationalre-enters and schedules.