Skip to content

Commit 447c1ab

Browse files
committed
[MTRDevice] Defer first Thread subscription until operational advertisement seen
On a cold start where the Home app foregrounds before the Thread interface has finished coming up, _delegateAdded -> _ensureSubscriptionForExistingDelegates -> _setupSubscriptionWithReason fires the very first SubscribeRequest before threadradiod's [WED_START] completes. The send fails immediately with "OS Error 0x02000041: No route to host" (errno 65) from UDPEndPointImplSockets.cpp. We then back off and retry, wasting 3-6 seconds of coldstart on a doomed attempt. Gate the first subscription attempt: if _deviceUsesThread is true and we have no cached _lastSubscriptionIPAddress, defer scheduling the pool work until either (a) nodeMayBeAdvertisingOperational fires for this device -- the strong signal that the Thread interface is up and the accessory is reachable -- or (b) a 1-second watchdog elapses -- the safety net for devices that never advertise during the deferral window. Implementation: - New deferringFirstThreadSubscription / hasDeferredFirstThreadSubscription properties on MTRDevice_Concrete. The first is YES only during the deferral window; the second is a one-shot latch so subsequent re-entries of _ensureSubscriptionForExistingDelegates (notably the one nodeMayBeAdvertisingOperational triggers after clearing the deferral) don't re-enter the deferral path. - _ensureSubscriptionForExistingDelegates factors the existing pool-work scheduling out into a local block scheduleSubscriptionPoolWork that the deferral watchdog reuses verbatim. - nodeMayBeAdvertisingOperational, when it observes a pending first-Thread deferral, clears the flag and re-enters _ensureSubscriptionForExistingDelegates with reason "operational advertisement seen" instead of going through the normal _triggerResubscribeWithReason path (which would no-op because we aren't yet in MTRInternalDeviceStateSubscribing). - 1s watchdog runs on the device queue, takes _lock, double-checks the deferral flag (the advertisement path may have cleared it racingly), and if still set proceeds with the original scheduling. - A monotonically-increasing firstThreadSubscribeWatchdogGeneration counter nonces the watchdog block. dispatch_after is uncancellable, so a stale watchdog from a prior arm cycle (e.g. after controllerSuspended -> controllerResumed re-armed the deferral) would otherwise fire against a freshly re-armed deferral and prematurely collapse the 1s safety window. Every clear path (advertisement, invalidate, controllerSuspended) and every fresh arm bumps the generation; the watchdog block captures armedGeneration and bails on mismatch. Only Thread devices are affected; Wi-Fi/Ethernet devices retain the existing immediate behavior. In the worst case (device that never advertises), the first subscribe is delayed by exactly 1s vs. today's "fire immediately, fail in 3-6s, then retry" behavior -- still a substantial coldstart win. Adds behavior tests covering the deferral, the advertisement-clears-deferral path, the 1s watchdog, the invalidate-during-deferral safety path, and the stale-watchdog-vs-re-armed-deferral race.
1 parent efc29a8 commit 447c1ab

2 files changed

Lines changed: 2605 additions & 15 deletions

File tree

src/darwin/Framework/CHIP/MTRDevice_Concrete.mm

Lines changed: 174 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,30 @@ @interface MTRDevice_Concrete ()
345345
@property (nonatomic) MTRInternalDeviceState internalDeviceState;
346346
@property (nonatomic) BOOL doingCASEAttemptForDeviceMayBeReachable;
347347

348+
// Set to YES while we are waiting to fire our very first subscribe attempt on
349+
// a Thread device for which we have no cached IP address. The actual subscribe
350+
// fires when either (a) an operational mDNS advertisement arrives for this
351+
// device, or (b) a 1-second watchdog elapses, whichever comes first. This
352+
// avoids racing the very first SubscribeRequest against threadradiod
353+
// [WED_START], which would otherwise produce a No-route-to-host failure and
354+
// burn 3-6 seconds of coldstart on a retry.
355+
@property (nonatomic) BOOL deferringFirstThreadSubscription;
356+
357+
// Whether we have already gone through the first-Thread-subscribe deferral
358+
// path for this MTRDevice instance. Used so we only defer once: subsequent
359+
// re-entries (e.g. the one nodeMayBeAdvertisingOperational triggers after
360+
// clearing the deferral) must not defer again.
361+
@property (nonatomic) BOOL hasDeferredFirstThreadSubscription;
362+
363+
// Monotonically-increasing generation counter for the first-Thread-subscribe
364+
// deferral watchdog. The 1-second watchdog is scheduled via dispatch_after,
365+
// which is uncancellable; a stale watchdog from a prior deferral cycle could
366+
// otherwise fire against a freshly re-armed deferral and prematurely collapse
367+
// the safety window. Each arm captures the current generation, and every
368+
// path that clears the deferral flag increments this counter so a stale
369+
// watchdog observes a generation mismatch and exits without doing work.
370+
@property (nonatomic) uint64_t firstThreadSubscribeWatchdogGeneration;
371+
348372
#define MTRDEVICE_SUBSCRIPTION_ATTEMPT_MIN_WAIT_SECONDS (1)
349373
#define MTRDEVICE_SUBSCRIPTION_ATTEMPT_MAX_WAIT_SECONDS (3600)
350374
@property (nonatomic) uint32_t lastSubscriptionAttemptWait;
@@ -1067,27 +1091,121 @@ - (void)_ensureSubscriptionForExistingDelegates:(NSString *)reason
10671091
mtr_weakify(self);
10681092
if ([self _deviceUsesThread]) {
10691093
MTR_LOG(" => %@ - device is a thread device, scheduling in pool", self);
1094+
1095+
// If a first-Thread-subscribe deferral is already in flight, do
1096+
// nothing on re-entry. The watchdog (or operational advertisement)
1097+
// that originally armed the gate is still on the hook to schedule
1098+
// the subscribe; coming back through here during the 1s window
1099+
// would otherwise either skip the gate (because
1100+
// hasDeferredFirstThreadSubscription is already YES, so
1101+
// shouldDeferForThreadColdstart evaluates to NO) and fall through
1102+
// to scheduleSubscriptionPoolWork(), bypassing the deferral.
1103+
if (_deferringFirstThreadSubscription) {
1104+
MTR_LOG("%@ - first Thread subscribe deferral still in flight; ignoring re-entry (%@)", self, reason);
1105+
return;
1106+
}
1107+
10701108
NSString * description = [NSString stringWithFormat:@"MTRDevice setDelegate first subscription / controller resume (%p)", self];
1071-
[self _scheduleSubscriptionPoolWork:^{
1109+
void (^scheduleSubscriptionPoolWork)(void) = ^{
10721110
mtr_strongify(self);
1073-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates _scheduleSubscriptionPoolWork called back with nil MTRDevice"));
1074-
1075-
[self->_deviceController asyncDispatchToMatterQueue:^{
1111+
if (!self) {
1112+
return;
1113+
}
1114+
// _scheduleSubscriptionPoolWork: requires _lock be held by the
1115+
// caller. Most callers below already do this; the watchdog
1116+
// path takes the lock explicitly before invoking this block.
1117+
os_unfair_lock_assert_owner(&self->_lock);
1118+
[self _scheduleSubscriptionPoolWork:^{
10761119
mtr_strongify(self);
1077-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue called back with nil MTRDevice"));
1120+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates _scheduleSubscriptionPoolWork called back with nil MTRDevice"));
1121+
1122+
[self->_deviceController asyncDispatchToMatterQueue:^{
1123+
mtr_strongify(self);
1124+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue called back with nil MTRDevice"));
1125+
1126+
std::lock_guard lock(self->_lock);
1127+
[self _setupSubscriptionWithReason:[NSString stringWithFormat:@"%@ and scheduled subscription is happening", reason]];
1128+
} errorHandler:^(NSError * _Nonnull error) {
1129+
mtr_strongify(self);
1130+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue errored with nil MTRDevice"));
1131+
1132+
// If controller is not running, clear work item from the subscription queue
1133+
MTR_LOG_ERROR("%@ could not dispatch to matter queue for resubscription - error %@", self, error);
1134+
std::lock_guard lock(self->_lock);
1135+
[self _clearSubscriptionPoolWork];
1136+
}];
1137+
} inNanoseconds:0 description:description];
1138+
};
10781139

1079-
std::lock_guard lock(self->_lock);
1080-
[self _setupSubscriptionWithReason:[NSString stringWithFormat:@"%@ and scheduled subscription is happening", reason]];
1081-
} errorHandler:^(NSError * _Nonnull error) {
1140+
// Coldstart gate: if this is our first subscription attempt for a
1141+
// Thread device and we have no cached IP address yet, the Thread
1142+
// interface is very likely still coming up. Firing immediately
1143+
// would produce a No-route-to-host (errno 65) from the first
1144+
// SubscribeRequest before threadradiod's [WED_START] completes,
1145+
// and we'd burn 3-6s of coldstart on a wasted attempt. Defer
1146+
// until either an operational advertisement arrives for this
1147+
// device (the strong signal that the interface is up) or a 1s
1148+
// watchdog elapses (the safety net), whichever comes first.
1149+
//
1150+
// Short-circuit the gate if we have evidence the device is
1151+
// already reachable: a recent report (within the last 60s) means
1152+
// the Thread interface is up and the device is talking to us, so
1153+
// there is no benefit to paying the 1s watchdog penalty (this is
1154+
// the warm-Thread cold-app-launch path).
1155+
NSDate * lastReportTime = [_mostRecentReportTimes lastObject];
1156+
BOOL hasRecentReport = (lastReportTime != nil
1157+
&& -[lastReportTime timeIntervalSinceNow] < 60.0);
1158+
BOOL shouldDeferForThreadColdstart = (!_lastSubscriptionIPAddress.has_value()
1159+
&& !hasRecentReport
1160+
&& !_deferringFirstThreadSubscription
1161+
&& !_hasDeferredFirstThreadSubscription);
1162+
if (shouldDeferForThreadColdstart) {
1163+
MTR_LOG("%@ - deferring first Thread subscribe up to 1s; waiting for operational advertisement", self);
1164+
_deferringFirstThreadSubscription = YES;
1165+
_hasDeferredFirstThreadSubscription = YES;
1166+
1167+
// Bump the generation so any stale watchdog from a prior arm
1168+
// (cleared and re-armed within the same MTRDevice lifetime —
1169+
// e.g. controllerSuspended -> controllerResumed) cannot fire
1170+
// against this fresh arm. The new watchdog block captures
1171+
// armedGeneration and bails on mismatch.
1172+
uint64_t armedGeneration = ++_firstThreadSubscribeWatchdogGeneration;
1173+
1174+
static constexpr int64_t kFirstThreadSubscribeWatchdogNs = static_cast<int64_t>(NSEC_PER_SEC);
1175+
dispatch_after(dispatch_time(DISPATCH_TIME_NOW, kFirstThreadSubscribeWatchdogNs), self.queue, ^{
10821176
mtr_strongify(self);
1083-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue errored with nil MTRDevice"));
1084-
1085-
// If controller is not running, clear work item from the subscription queue
1086-
MTR_LOG_ERROR("%@ could not dispatch to matter queue for resubscription - error %@", self, error);
1177+
if (!self) {
1178+
return;
1179+
}
10871180
std::lock_guard lock(self->_lock);
1088-
[self _clearSubscriptionPoolWork];
1089-
}];
1090-
} inNanoseconds:0 description:description];
1181+
if (self->_firstThreadSubscribeWatchdogGeneration != armedGeneration) {
1182+
// A subsequent arm/clear cycle bumped the generation;
1183+
// this watchdog is stale and must not act.
1184+
return;
1185+
}
1186+
if (!self->_deferringFirstThreadSubscription) {
1187+
// Operational advertisement already triggered the subscribe.
1188+
return;
1189+
}
1190+
self->_deferringFirstThreadSubscription = NO;
1191+
// Defensive: invalidate / controllerSuspended could have
1192+
// landed during the 1s window and torn down the
1193+
// subscription path even though they also clear
1194+
// deferringFirstThreadSubscription (so the early-return
1195+
// above usually catches this). Belt-and-suspenders: re-
1196+
// check that subscriptions are still allowed before
1197+
// scheduling the pool work.
1198+
if (![self _subscriptionsAllowed]) {
1199+
MTR_LOG("%@ - first Thread subscribe watchdog: subscriptions no longer allowed; aborting", self);
1200+
return;
1201+
}
1202+
MTR_LOG("%@ - first Thread subscribe watchdog fired; proceeding without operational advertisement", self);
1203+
scheduleSubscriptionPoolWork();
1204+
});
1205+
return;
1206+
}
1207+
1208+
scheduleSubscriptionPoolWork();
10911209
} else {
10921210
[_deviceController asyncDispatchToMatterQueue:^{
10931211
mtr_strongify(self);
@@ -1119,6 +1237,18 @@ - (void)invalidate
11191237
// attempt, since we now have no delegate.
11201238
_reattemptingSubscription = NO;
11211239

1240+
// Clear the first-Thread-subscribe deferral state under _lock so that any
1241+
// already-armed watchdog short-circuits at its early-return when it
1242+
// observes deferringFirstThreadSubscription == NO, and so a future
1243+
// re-attach starts from a clean slate.
1244+
_deferringFirstThreadSubscription = NO;
1245+
_hasDeferredFirstThreadSubscription = NO;
1246+
// Bump the generation so that any in-flight watchdog from before
1247+
// invalidate is unambiguously stale. A fresh MTRDevice instance always
1248+
// starts at generation 0; this only matters within the lifetime of a
1249+
// single instance (invalidate followed by re-use is uncommon but legal).
1250+
++_firstThreadSubscribeWatchdogGeneration;
1251+
11221252
// Clear subscription pool work item if it's in progress, to avoid forever
11231253
// taking up a slot in the controller's work queue.
11241254
[self _clearSubscriptionPoolWork];
@@ -1152,6 +1282,22 @@ - (void)nodeMayBeAdvertisingOperational
11521282

11531283
MTR_LOG("%@ saw new operational advertisement", self);
11541284

1285+
// If we were deferring our very first Thread subscribe waiting for exactly
1286+
// this signal, clear the deferral flag and trigger the subscribe via the
1287+
// normal path immediately rather than waiting for the 1s watchdog.
1288+
{
1289+
std::lock_guard lock(_lock);
1290+
if (_deferringFirstThreadSubscription) {
1291+
_deferringFirstThreadSubscription = NO;
1292+
// Bump the watchdog generation so any in-flight stale watchdog
1293+
// does not fire against a future re-arm of the deferral.
1294+
++_firstThreadSubscribeWatchdogGeneration;
1295+
MTR_LOG("%@ - operational advertisement cleared first-Thread-subscribe deferral", self);
1296+
[self _ensureSubscriptionForExistingDelegates:@"operational advertisement seen"];
1297+
return;
1298+
}
1299+
}
1300+
11551301
[self _triggerResubscribeWithReason:@"operational advertisement seen"
11561302
nodeLikelyReachable:YES];
11571303
}
@@ -5164,6 +5310,19 @@ - (void)controllerSuspended
51645310
self.suspended = YES;
51655311
[self _resetSubscriptionWithReasonString:@"Controller suspended"];
51665312

5313+
// Clear any pending first-Thread-subscribe deferral so an already-armed
5314+
// watchdog short-circuits at its early-return rather than scheduling a
5315+
// subscribe that _subscriptionsAllowed will refuse anyway. The latch is
5316+
// also cleared so that a controllerResumed -> _ensureSubscriptionForExistingDelegates
5317+
// re-entry can re-arm the deferral path against a fresh resume.
5318+
_deferringFirstThreadSubscription = NO;
5319+
_hasDeferredFirstThreadSubscription = NO;
5320+
// Bump the generation so any in-flight watchdog from before suspend is
5321+
// unambiguously stale — this is the canonical re-arm path
5322+
// (suspend -> resume) where a stale watchdog could otherwise collapse
5323+
// the safety window of the freshly-armed deferral.
5324+
++_firstThreadSubscribeWatchdogGeneration;
5325+
51675326
// Ensure that any pre-existing resubscribe attempts we control don't try to
51685327
// do anything.
51695328
_reattemptingSubscription = NO;

0 commit comments

Comments
 (0)