Skip to content

Commit 87ebede

Browse files
committed
[MTRDevice] Defer first Thread subscription until operational advertisement seen
On a cold start where the Home app foregrounds before the Thread interface has finished coming up, _delegateAdded -> _ensureSubscriptionForExistingDelegates -> _setupSubscriptionWithReason fires the very first SubscribeRequest before threadradiod's [WED_START] completes. The send fails immediately with "OS Error 0x02000041: No route to host" (errno 65) from UDPEndPointImplSockets.cpp. We then back off and retry, wasting 3-6 seconds of coldstart on a doomed attempt. Gate the first subscription attempt: if _deviceUsesThread is true and we have no cached _lastSubscriptionIPAddress, defer scheduling the pool work until either (a) nodeMayBeAdvertisingOperational fires for this device -- the strong signal that the Thread interface is up and the accessory is reachable -- or (b) a 1-second watchdog elapses -- the safety net for devices that never advertise during the deferral window. Implementation: - New deferringFirstThreadSubscription / hasDeferredFirstThreadSubscription properties on MTRDevice_Concrete. The first is YES only during the deferral window; the second is a one-shot latch so subsequent re-entries of _ensureSubscriptionForExistingDelegates (notably the one nodeMayBeAdvertisingOperational triggers after clearing the deferral) don't re-enter the deferral path. - _ensureSubscriptionForExistingDelegates factors the existing pool-work scheduling out into a local block scheduleSubscriptionPoolWork that the deferral watchdog reuses verbatim. - nodeMayBeAdvertisingOperational, when it observes a pending first-Thread deferral, clears the flag and re-enters _ensureSubscriptionForExistingDelegates with reason "operational advertisement seen" instead of going through the normal _triggerResubscribeWithReason path (which would no-op because we aren't yet in MTRInternalDeviceStateSubscribing). - 1s watchdog runs on the device queue, takes _lock, double-checks the deferral flag (the advertisement path may have cleared it racingly), and if still set proceeds with the original scheduling. Only Thread devices are affected; Wi-Fi/Ethernet devices retain the existing immediate behavior. In the worst case (device that never advertises), the first subscribe is delayed by exactly 1s vs. today's "fire immediately, fail in 3-6s, then retry" behavior -- still a substantial coldstart win. Adds behavior tests covering the deferral, the advertisement-clears-deferral path, the 1s watchdog, and the invalidate-during-deferral safety path. rdar://168903097
1 parent efc29a8 commit 87ebede

2 files changed

Lines changed: 2367 additions & 15 deletions

File tree

src/darwin/Framework/CHIP/MTRDevice_Concrete.mm

Lines changed: 144 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,21 @@ @interface MTRDevice_Concrete ()
345345
@property (nonatomic) MTRInternalDeviceState internalDeviceState;
346346
@property (nonatomic) BOOL doingCASEAttemptForDeviceMayBeReachable;
347347

348+
// Set to YES while we are waiting to fire our very first subscribe attempt on
349+
// a Thread device for which we have no cached IP address. The actual subscribe
350+
// fires when either (a) an operational mDNS advertisement arrives for this
351+
// device, or (b) a 1-second watchdog elapses, whichever comes first. This
352+
// avoids racing the very first SubscribeRequest against threadradiod
353+
// [WED_START], which would otherwise produce a No-route-to-host failure and
354+
// burn 3-6 seconds of coldstart on a retry.
355+
@property (nonatomic) BOOL deferringFirstThreadSubscription;
356+
357+
// Whether we have already gone through the first-Thread-subscribe deferral
358+
// path for this MTRDevice instance. Used so we only defer once: subsequent
359+
// re-entries (e.g. the one nodeMayBeAdvertisingOperational triggers after
360+
// clearing the deferral) must not defer again.
361+
@property (nonatomic) BOOL hasDeferredFirstThreadSubscription;
362+
348363
#define MTRDEVICE_SUBSCRIPTION_ATTEMPT_MIN_WAIT_SECONDS (1)
349364
#define MTRDEVICE_SUBSCRIPTION_ATTEMPT_MAX_WAIT_SECONDS (3600)
350365
@property (nonatomic) uint32_t lastSubscriptionAttemptWait;
@@ -1067,27 +1082,109 @@ - (void)_ensureSubscriptionForExistingDelegates:(NSString *)reason
10671082
mtr_weakify(self);
10681083
if ([self _deviceUsesThread]) {
10691084
MTR_LOG(" => %@ - device is a thread device, scheduling in pool", self);
1085+
1086+
// If a first-Thread-subscribe deferral is already in flight, do
1087+
// nothing on re-entry. The watchdog (or operational advertisement)
1088+
// that originally armed the gate is still on the hook to schedule
1089+
// the subscribe; coming back through here during the 1s window
1090+
// would otherwise either skip the gate (because
1091+
// hasDeferredFirstThreadSubscription is already YES, so
1092+
// shouldDeferForThreadColdstart evaluates to NO) and fall through
1093+
// to scheduleSubscriptionPoolWork(), bypassing the deferral.
1094+
if (_deferringFirstThreadSubscription) {
1095+
MTR_LOG("%@ - first Thread subscribe deferral still in flight; ignoring re-entry (%@)", self, reason);
1096+
return;
1097+
}
1098+
10701099
NSString * description = [NSString stringWithFormat:@"MTRDevice setDelegate first subscription / controller resume (%p)", self];
1071-
[self _scheduleSubscriptionPoolWork:^{
1100+
void (^scheduleSubscriptionPoolWork)(void) = ^{
10721101
mtr_strongify(self);
1073-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates _scheduleSubscriptionPoolWork called back with nil MTRDevice"));
1074-
1075-
[self->_deviceController asyncDispatchToMatterQueue:^{
1102+
if (!self) {
1103+
return;
1104+
}
1105+
// _scheduleSubscriptionPoolWork: requires _lock be held by the
1106+
// caller. Most callers below already do this; the watchdog
1107+
// path takes the lock explicitly before invoking this block.
1108+
os_unfair_lock_assert_owner(&self->_lock);
1109+
[self _scheduleSubscriptionPoolWork:^{
10761110
mtr_strongify(self);
1077-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue called back with nil MTRDevice"));
1111+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates _scheduleSubscriptionPoolWork called back with nil MTRDevice"));
1112+
1113+
[self->_deviceController asyncDispatchToMatterQueue:^{
1114+
mtr_strongify(self);
1115+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue called back with nil MTRDevice"));
1116+
1117+
std::lock_guard lock(self->_lock);
1118+
[self _setupSubscriptionWithReason:[NSString stringWithFormat:@"%@ and scheduled subscription is happening", reason]];
1119+
} errorHandler:^(NSError * _Nonnull error) {
1120+
mtr_strongify(self);
1121+
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue errored with nil MTRDevice"));
1122+
1123+
// If controller is not running, clear work item from the subscription queue
1124+
MTR_LOG_ERROR("%@ could not dispatch to matter queue for resubscription - error %@", self, error);
1125+
std::lock_guard lock(self->_lock);
1126+
[self _clearSubscriptionPoolWork];
1127+
}];
1128+
} inNanoseconds:0 description:description];
1129+
};
10781130

1079-
std::lock_guard lock(self->_lock);
1080-
[self _setupSubscriptionWithReason:[NSString stringWithFormat:@"%@ and scheduled subscription is happening", reason]];
1081-
} errorHandler:^(NSError * _Nonnull error) {
1131+
// Coldstart gate: if this is our first subscription attempt for a
1132+
// Thread device and we have no cached IP address yet, the Thread
1133+
// interface is very likely still coming up. Firing immediately
1134+
// would produce a No-route-to-host (errno 65) from the first
1135+
// SubscribeRequest before threadradiod's [WED_START] completes,
1136+
// and we'd burn 3-6s of coldstart on a wasted attempt. Defer
1137+
// until either an operational advertisement arrives for this
1138+
// device (the strong signal that the interface is up) or a 1s
1139+
// watchdog elapses (the safety net), whichever comes first.
1140+
//
1141+
// Short-circuit the gate if we have evidence the device is
1142+
// already reachable: a recent report (within the last 60s) means
1143+
// the Thread interface is up and the device is talking to us, so
1144+
// there is no benefit to paying the 1s watchdog penalty (this is
1145+
// the warm-Thread cold-app-launch path).
1146+
NSDate * lastReportTime = [_mostRecentReportTimes lastObject];
1147+
BOOL hasRecentReport = (lastReportTime != nil
1148+
&& -[lastReportTime timeIntervalSinceNow] < 60.0);
1149+
BOOL shouldDeferForThreadColdstart = (!_lastSubscriptionIPAddress.has_value()
1150+
&& !hasRecentReport
1151+
&& !_deferringFirstThreadSubscription
1152+
&& !_hasDeferredFirstThreadSubscription);
1153+
if (shouldDeferForThreadColdstart) {
1154+
MTR_LOG("%@ - deferring first Thread subscribe up to 1s; waiting for operational advertisement", self);
1155+
_deferringFirstThreadSubscription = YES;
1156+
_hasDeferredFirstThreadSubscription = YES;
1157+
1158+
static constexpr int64_t kFirstThreadSubscribeWatchdogNs = static_cast<int64_t>(NSEC_PER_SEC);
1159+
dispatch_after(dispatch_time(DISPATCH_TIME_NOW, kFirstThreadSubscribeWatchdogNs), self.queue, ^{
10821160
mtr_strongify(self);
1083-
VerifyOrReturn(self, MTR_LOG_DEBUG("_ensureSubscriptionForExistingDelegates asyncDispatchToMatterQueue errored with nil MTRDevice"));
1084-
1085-
// If controller is not running, clear work item from the subscription queue
1086-
MTR_LOG_ERROR("%@ could not dispatch to matter queue for resubscription - error %@", self, error);
1161+
if (!self) {
1162+
return;
1163+
}
10871164
std::lock_guard lock(self->_lock);
1088-
[self _clearSubscriptionPoolWork];
1089-
}];
1090-
} inNanoseconds:0 description:description];
1165+
if (!self->_deferringFirstThreadSubscription) {
1166+
// Operational advertisement already triggered the subscribe.
1167+
return;
1168+
}
1169+
self->_deferringFirstThreadSubscription = NO;
1170+
// Defensive: invalidate / controllerSuspended could have
1171+
// landed during the 1s window and torn down the
1172+
// subscription path even though they also clear
1173+
// deferringFirstThreadSubscription (so the early-return
1174+
// above usually catches this). Belt-and-suspenders: re-
1175+
// check that subscriptions are still allowed before
1176+
// scheduling the pool work.
1177+
if (![self _subscriptionsAllowed]) {
1178+
MTR_LOG("%@ - first Thread subscribe watchdog: subscriptions no longer allowed; aborting", self);
1179+
return;
1180+
}
1181+
MTR_LOG("%@ - first Thread subscribe watchdog fired; proceeding without operational advertisement", self);
1182+
scheduleSubscriptionPoolWork();
1183+
});
1184+
return;
1185+
}
1186+
1187+
scheduleSubscriptionPoolWork();
10911188
} else {
10921189
[_deviceController asyncDispatchToMatterQueue:^{
10931190
mtr_strongify(self);
@@ -1119,6 +1216,13 @@ - (void)invalidate
11191216
// attempt, since we now have no delegate.
11201217
_reattemptingSubscription = NO;
11211218

1219+
// Clear the first-Thread-subscribe deferral state under _lock so that any
1220+
// already-armed watchdog short-circuits at its early-return when it
1221+
// observes deferringFirstThreadSubscription == NO, and so a future
1222+
// re-attach starts from a clean slate.
1223+
_deferringFirstThreadSubscription = NO;
1224+
_hasDeferredFirstThreadSubscription = NO;
1225+
11221226
// Clear subscription pool work item if it's in progress, to avoid forever
11231227
// taking up a slot in the controller's work queue.
11241228
[self _clearSubscriptionPoolWork];
@@ -1152,6 +1256,23 @@ - (void)nodeMayBeAdvertisingOperational
11521256

11531257
MTR_LOG("%@ saw new operational advertisement", self);
11541258

1259+
// If we were deferring our very first Thread subscribe waiting for exactly
1260+
// this signal, clear the deferral flag and trigger the subscribe via the
1261+
// normal path immediately rather than waiting for the 1s watchdog.
1262+
{
1263+
std::lock_guard lock(_lock);
1264+
if (_deferringFirstThreadSubscription) {
1265+
_deferringFirstThreadSubscription = NO;
1266+
MTR_LOG("%@ - operational advertisement cleared first-Thread-subscribe deferral", self);
1267+
[self _ensureSubscriptionForExistingDelegates:@"operational advertisement seen"];
1268+
return;
1269+
}
1270+
MTR_LOG("%@ - operational advertisement cleared first-Thread-subscribe deferral", self);
1271+
[self _ensureSubscriptionForExistingDelegates:@"operational advertisement seen"];
1272+
return;
1273+
}
1274+
}
1275+
11551276
[self _triggerResubscribeWithReason:@"operational advertisement seen"
11561277
nodeLikelyReachable:YES];
11571278
}
@@ -5164,6 +5285,14 @@ - (void)controllerSuspended
51645285
self.suspended = YES;
51655286
[self _resetSubscriptionWithReasonString:@"Controller suspended"];
51665287

5288+
// Clear any pending first-Thread-subscribe deferral so an already-armed
5289+
// watchdog short-circuits at its early-return rather than scheduling a
5290+
// subscribe that _subscriptionsAllowed will refuse anyway. The latch is
5291+
// also cleared so that a controllerResumed -> _ensureSubscriptionForExistingDelegates
5292+
// re-entry can re-arm the deferral path against a fresh resume.
5293+
_deferringFirstThreadSubscription = NO;
5294+
_hasDeferredFirstThreadSubscription = NO;
5295+
51675296
// Ensure that any pre-existing resubscribe attempts we control don't try to
51685297
// do anything.
51695298
_reattemptingSubscription = NO;

0 commit comments

Comments
 (0)