Skip to content

Commit 003a5ed

Browse files
committed
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client (Home app, HomeSmartMatter appex, etc.) implements commissioning:paseSessionEstablishmentComplete: but then drops the flow without ever calling commissionNodeWithID: or cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete keeps the operation alive via currentInternalCommissioning, so every subsequent commissioning attempt hits the busy check in startWithController and fails with CHIP_ERROR_BUSY (0xDB). The user's symptom is "Matter accessories refuse to connect" indefinitely until the daemon (or app) is restarted. Add a 5-minute dispatch_source_t watchdog armed when we set isWaitingAfterPASEEstablished = YES and cancelled when: - the property is set back to NO (via the new custom setter that fires whenever commissionNodeWithID: succeeds on the controller), - the operation is stopped, - any terminal _dispatchCommissioningError path runs, - or the operation deallocates. On expiry, the watchdog stops the in-flight commissioning via stopCommissioning:forCommissioningID: (which routes through StopPairing in both the legacy isInternallyCreated:YES flow AND the public-API flow) and routes a CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError path so client state can settle back to "no commissioning in flight". Five minutes is comfortably longer than the commissionee fail-safe (60-180 seconds) but short enough that the user is unblocked within a single sit-down troubleshooting session. Also add a defensive secondary patch in setupCommissioningSessionWithPayload:newNodeID:error: -- if a stale internally-created commissioning is still parked at the post-PASE waiting state with a *different* setup payload, treat the new call as an implicit cancel of the previous one. Without this, the new call would have failed with CHIP_ERROR_BUSY because the controller still considered the prior commissioning in flight. The implicit cancel propagates any StopPairing failure back to the caller rather than silently starting a new commissioning that is about to hit the same busy state. The implicit cancel needs the prior operation's commissioningID, so a small read-only commissioningID accessor is added to MTRCommissioningOperation_Internal.h. Threading: - The isWaitingAfterPASEEstablished setter writes the flag synchronously under an os_unfair_lock so a caller that immediately reads it back observes the updated value, then bounces only the watchdog teardown side effect onto _delegateQueue (which is the queue that owns the dispatch_source_t timer). - The watchdog event_handler consults isWaitingAfterPASEEstablished as a late-fire guard: if the client has legitimately advanced past the post-PASE waiting state but a timer block was already posted to the queue ahead of the cancel, the handler swallows the late fire instead of routing a spurious CHIP_ERROR_TIMEOUT into a now-active flow. - -stop and _dispatchCommissioningError clear isWaitingAfterPASEEstablished synchronously before bouncing the watchdog cancel, so a timer event already enqueued on _delegateQueue is suppressed by the late-fire guard. - The fire path uses stopCommissioning:forCommissioningID: rather than cancelCommissioningForNodeID: so StopPairing runs on the CHIP layer in both the isInternallyCreated:YES and isInternallyCreated:NO flows, and checks the BOOL return so that a watchdog firing against a commissioning that has already been replaced does not surface a spurious timeout into the successor. - If dispatch_source_create fails when arming the watchdog, the establishment-done path now routes a CHIP_ERROR_NO_MEMORY through the standard failure path rather than entering the unbounded wait state silently. rdar://171083184
1 parent efc29a8 commit 003a5ed

6 files changed

Lines changed: 3427 additions & 3 deletions

File tree

src/darwin/Framework/CHIP/MTRCommissioningOperation.mm

Lines changed: 273 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,12 @@
1616

1717
#import <Foundation/Foundation.h>
1818
#import <Matter/MTRDeviceAttestationDelegate.h>
19+
#import <os/lock.h>
1920

2021
#import "MTRCommissioningDelegate_Internal.h"
2122
#import "MTRCommissioningOperation.h"
2223
#import "MTRCommissioningOperation_Internal.h"
24+
#import "MTRCommissioningOperation_Test.h"
2325
#import "MTRDefines_Internal.h"
2426
#import "MTRDeviceControllerDelegate_Internal.h"
2527
#import "MTRDeviceController_Concrete.h"
@@ -36,6 +38,19 @@
3638
using namespace chip;
3739
using namespace chip::Tracing::DarwinFramework;
3840

41+
// Production interval after which the post-PASE watchdog fires. Five
42+
// minutes is comfortably longer than the commissionee fail-safe (60-180
43+
// seconds) but short enough that the user gets unblocked well within a
44+
// single sit-down troubleshooting session. Exposed via
45+
// MTRCommissioningOperation_Test.h so tests can assert against the same
46+
// value production uses.
47+
extern const NSTimeInterval kMTRPostPASEWatchdogInterval = 5 * 60;
48+
49+
// Test-only override for kMTRPostPASEWatchdogInterval. Non-zero values
50+
// take effect for all subsequently armed watchdogs in the current
51+
// process; production reads kMTRPostPASEWatchdogInterval directly.
52+
static NSTimeInterval sMTRPostPASEWatchdogIntervalForTesting = 0;
53+
3954
@interface MTRCommissioningOperationDeviceAttestationDelegate : NSObject <MTRDeviceAttestationDelegate>
4055
@property (nonatomic, weak) MTRCommissioningOperation * commissioningOperation;
4156
@end
@@ -49,6 +64,67 @@ @implementation MTRCommissioningOperation {
4964
id<MTRCommissioningDelegate> __weak _delegate;
5065
dispatch_queue_t _delegateQueue;
5166
MTRDeviceController_Concrete * __weak _controller;
67+
// Watchdog timer armed once PASE has been established to bound how long we
68+
// hold the controller in the "in-progress commissioning" state if the
69+
// client (Home app, HomeSmartMatter appex, etc.) drops the flow without
70+
// ever calling commissionNodeWithID: or cancelCommissioningForNodeID:.
71+
// Without this watchdog, _currentInternalCommissioning is never cleared
72+
// and every subsequent commissioning attempt hits CHIP_ERROR_BUSY (0xDB)
73+
// until the process restarts.
74+
//
75+
// _postPASEWatchdog is read/written only on _delegateQueue.
76+
dispatch_source_t _postPASEWatchdog;
77+
// Backing ivar for the isWaitingAfterPASEEstablished property. Reads and
78+
// writes are serialized by _stateLock so callers from any queue see a
79+
// consistent value immediately after the setter returns; the watchdog
80+
// teardown that runs as a side effect of clearing the flag is bounced
81+
// onto _delegateQueue (which is the only queue allowed to touch
82+
// _postPASEWatchdog).
83+
BOOL _isWaitingAfterPASEEstablished;
84+
os_unfair_lock _stateLock;
85+
// Test-only one-shot fault injection. When YES, the next call to
86+
// _armPostPASEWatchdog will simulate a dispatch_source_create failure
87+
// (return NO without arming the timer) and clear the flag. Read and
88+
// written only on _delegateQueue, mirroring _postPASEWatchdog itself.
89+
BOOL _forceNextArmFailureForTesting;
90+
}
91+
92+
@synthesize commissioningID = _commissioningID;
93+
94+
- (BOOL)isWaitingAfterPASEEstablished
95+
{
96+
os_unfair_lock_lock(&_stateLock);
97+
BOOL value = _isWaitingAfterPASEEstablished;
98+
os_unfair_lock_unlock(&_stateLock);
99+
return value;
100+
}
101+
102+
- (void)setIsWaitingAfterPASEEstablished:(BOOL)isWaitingAfterPASEEstablished
103+
{
104+
// Write the flag synchronously under _stateLock so that any caller that
105+
// immediately reads back isWaitingAfterPASEEstablished observes the
106+
// updated value (callers like MTRDeviceController_Concrete depend on
107+
// this caller-visible ordering). Only the watchdog teardown side
108+
// effect is bounced onto _delegateQueue, since _postPASEWatchdog itself
109+
// is _delegateQueue-owned.
110+
BOOL changedToNo = NO;
111+
os_unfair_lock_lock(&_stateLock);
112+
if (_isWaitingAfterPASEEstablished != isWaitingAfterPASEEstablished) {
113+
_isWaitingAfterPASEEstablished = isWaitingAfterPASEEstablished;
114+
changedToNo = !isWaitingAfterPASEEstablished;
115+
}
116+
os_unfair_lock_unlock(&_stateLock);
117+
118+
if (changedToNo) {
119+
// The client has either started commissioning (via
120+
// commissionNodeWithID:) or moved on; the watchdog no longer needs
121+
// to fire. The cancel must happen on _delegateQueue because that
122+
// is where _postPASEWatchdog is owned and where the timer's
123+
// event_handler is delivered.
124+
dispatch_async(_delegateQueue, ^{
125+
[self _cancelPostPASEWatchdog];
126+
});
127+
}
52128
}
53129

54130
- (instancetype)initWithParameters:(MTRCommissioningParameters *)parameters
@@ -97,13 +173,135 @@ - (instancetype)initWithParameters:(MTRCommissioningParameters *)parameters
97173
_isInternallyCreated = isInternallyCreated;
98174
_delegate = delegate;
99175
_delegateQueue = queue;
176+
_stateLock = OS_UNFAIR_LOCK_INIT;
100177

101178
// Don't hold on to the provided attestation delegate, which we never use.
102179
_parameters.deviceAttestationDelegate = nil;
103180

104181
return self;
105182
}
106183

184+
- (void)dealloc
185+
{
186+
// Defensive: ensure the watchdog timer doesn't outlive us. In all
187+
// non-dealloc code paths, _postPASEWatchdog is read/written only on
188+
// _delegateQueue. In dealloc itself the timer source's event_handler
189+
// cannot still be in flight (the handler retains self via mtr_strongify
190+
// for its duration), so a synchronous cancel here is safe regardless
191+
// of which queue is executing dealloc.
192+
if (_postPASEWatchdog) {
193+
dispatch_source_cancel(_postPASEWatchdog);
194+
_postPASEWatchdog = nil;
195+
}
196+
}
197+
198+
- (BOOL)_armPostPASEWatchdog
199+
{
200+
// _postPASEWatchdog is owned by _delegateQueue; production callers
201+
// invoke this from inside that queue.
202+
203+
// Test-only one-shot fault injection: simulate a
204+
// dispatch_source_create failure so callers can exercise the
205+
// "watchdog could not be armed -> bail out" path without having to
206+
// actually exhaust dispatch sources.
207+
if (_forceNextArmFailureForTesting) {
208+
_forceNextArmFailureForTesting = NO;
209+
MTR_LOG_ERROR("%@ post-PASE watchdog arm forced to fail (testing)", self);
210+
return NO;
211+
}
212+
213+
// Production interval lives in kMTRPostPASEWatchdogInterval; tests
214+
// can shorten it via setPostPASEWatchdogIntervalForTesting:.
215+
NSTimeInterval intervalSeconds = (sMTRPostPASEWatchdogIntervalForTesting > 0)
216+
? sMTRPostPASEWatchdogIntervalForTesting
217+
: kMTRPostPASEWatchdogInterval;
218+
219+
if (_postPASEWatchdog) {
220+
// Already armed; should not happen but defend against double-arm.
221+
return YES;
222+
}
223+
224+
_postPASEWatchdog = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, _delegateQueue);
225+
if (!_postPASEWatchdog) {
226+
MTR_LOG_ERROR("%@ failed to create post-PASE watchdog timer", self);
227+
return NO;
228+
}
229+
230+
dispatch_source_set_timer(_postPASEWatchdog,
231+
dispatch_time(DISPATCH_TIME_NOW, (int64_t) (intervalSeconds * NSEC_PER_SEC)),
232+
DISPATCH_TIME_FOREVER,
233+
1 * NSEC_PER_SEC);
234+
235+
mtr_weakify(self);
236+
dispatch_source_set_event_handler(_postPASEWatchdog, ^{
237+
mtr_strongify(self);
238+
if (!self) {
239+
return;
240+
}
241+
// Late-fire guard: if the client has already advanced past the
242+
// post-PASE waiting state (e.g. the client's commissionNodeWithID:
243+
// succeeded and the controller flipped isWaitingAfterPASEEstablished
244+
// back to NO) but the timer block had already been posted to the
245+
// queue ahead of the cancel, we must NOT route a spurious
246+
// CHIP_ERROR_TIMEOUT through the failure path -- doing so would
247+
// tear down a commissioning that is now legitimately in flight.
248+
if (!self.isWaitingAfterPASEEstablished) {
249+
return;
250+
}
251+
MTR_LOG_ERROR("%@ post-PASE watchdog fired -- client never advanced past paseSessionEstablishmentComplete; cancelling commissioning to release controller", self);
252+
[self _firePostPASEWatchdog];
253+
});
254+
255+
dispatch_resume(_postPASEWatchdog);
256+
return YES;
257+
}
258+
259+
- (void)_cancelPostPASEWatchdog
260+
{
261+
// All access to _postPASEWatchdog is serialized on _delegateQueue
262+
// in production paths; tests may invoke this helper directly so we
263+
// do not assert.
264+
265+
if (_postPASEWatchdog) {
266+
dispatch_source_cancel(_postPASEWatchdog);
267+
_postPASEWatchdog = nil;
268+
}
269+
}
270+
271+
- (void)_firePostPASEWatchdog
272+
{
273+
// The handler is delivered on _delegateQueue. Self-cancel by asking the
274+
// controller to stop commissioning for our commissioningID (which routes
275+
// through StopPairing in both internal and non-internal flows), then
276+
// notify the delegate via the standard error path so any UI / client
277+
// state can settle back to "no commissioning in flight".
278+
279+
[self _cancelPostPASEWatchdog];
280+
281+
MTRDeviceController_Concrete * strongController = _controller;
282+
if (strongController) {
283+
// Use stopCommissioning:forCommissioningID: rather than
284+
// cancelCommissioningForNodeID:. The latter only does anything for
285+
// the legacy isInternallyCreated:YES flow, but the watchdog is
286+
// armed for any client that implements
287+
// commissioning:paseSessionEstablishmentComplete:, regardless of
288+
// creation path. stopCommissioning: invokes StopPairing on the
289+
// CHIP layer in both cases so the controller actually releases
290+
// its in-progress commissioning slot.
291+
BOOL stopped = [strongController stopCommissioning:self forCommissioningID:_commissioningID];
292+
if (!stopped) {
293+
// Commissioning was already replaced by a successor; do NOT
294+
// surface a spurious CHIP_ERROR_TIMEOUT into a now-unrelated
295+
// commissioning. The replacement path already cleared our
296+
// delegate via commissioningDone:, so just log and bail.
297+
MTR_LOG("%@ post-PASE watchdog fired but commissioning was already replaced; suppressing spurious timeout", self);
298+
return;
299+
}
300+
}
301+
302+
[self _dispatchCommissioningError:[MTRError errorForCHIPErrorCode:CHIP_ERROR_TIMEOUT]];
303+
}
304+
107305
static inline void emitMetricForSetupPayload(NSString * payload)
108306
{
109307
std::vector<SetupPayload> payloads;
@@ -169,6 +367,18 @@ - (void)startWithController:(MTRDeviceController *)controller
169367

170368
- (BOOL)stop
171369
{
370+
// Clear the waiting-after-PASE flag synchronously so that any in-flight
371+
// watchdog event_handler block already sitting on _delegateQueue is
372+
// suppressed by the late-fire guard before its cancel arrives. Then
373+
// bounce the cancel of _postPASEWatchdog itself onto _delegateQueue so
374+
// we don't race with the timer's event_handler.
375+
os_unfair_lock_lock(&_stateLock);
376+
_isWaitingAfterPASEEstablished = NO;
377+
os_unfair_lock_unlock(&_stateLock);
378+
dispatch_async(_delegateQueue, ^{
379+
[self _cancelPostPASEWatchdog];
380+
});
381+
172382
MTRDeviceController_Concrete * strongController = _controller;
173383
if (!strongController) {
174384
// Nothing to do; controller is gone, so we are stopped no matter what.
@@ -208,6 +418,18 @@ - (void)_dispatchCommissioningError:(NSError *)error withMetrics:(MTRMetrics *)m
208418

209419
- (void)_dispatchCommissioningError:(NSError *)error forCommissioningID:(NSNumber *)commissioningID withMetrics:(MTRMetrics *)metrics
210420
{
421+
// Any terminal error path implies the watchdog (if armed) no longer needs
422+
// to fire. Clear the waiting-after-PASE flag synchronously so that any
423+
// already-enqueued watchdog event_handler is suppressed by the late-fire
424+
// guard, then bounce the dispatch_source cancel itself onto _delegateQueue
425+
// (this method may be invoked on the Matter / CHIP thread).
426+
os_unfair_lock_lock(&_stateLock);
427+
_isWaitingAfterPASEEstablished = NO;
428+
os_unfair_lock_unlock(&_stateLock);
429+
dispatch_async(_delegateQueue, ^{
430+
[self _cancelPostPASEWatchdog];
431+
});
432+
211433
MTRDeviceController_Concrete * strongController = _controller;
212434

213435
MTR_LOG("%@ Device commissioning failed with controller %@ metrics %@", self, strongController, metrics);
@@ -263,7 +485,32 @@ - (void)controller:(MTRDeviceController *)controller commissioningSessionEstabli
263485
// commissioning ourselves if not.
264486
if ([strongDelegate respondsToSelector:@selector(commissioning:paseSessionEstablishmentComplete:)]) {
265487
dispatch_async(_delegateQueue, ^{
266-
self.isWaitingAfterPASEEstablished = YES;
488+
// Arm the watchdog inside this block so that the arm and the
489+
// paired ivar write below are both serialized on _delegateQueue
490+
// with all subsequent _postPASEWatchdog mutations. A client
491+
// that handles paseSessionEstablishmentComplete: but then
492+
// loses interest -- without ever calling commissionNodeWithID:
493+
// or cancelCommissioningForNodeID: -- would otherwise
494+
// permanently wedge the controller's commissioning state.
495+
BOOL armed = [self _armPostPASEWatchdog];
496+
if (!armed) {
497+
// Failing to arm the watchdog means we cannot bound the
498+
// post-PASE wait at all -- that is the very wedge this
499+
// logic exists to prevent. Surface a no-memory error so
500+
// the client tears down rather than going into the
501+
// unbounded wait state.
502+
MTR_LOG_ERROR("%@ failed to arm post-PASE watchdog; aborting commissioning to avoid an unbounded wait", self);
503+
[self _dispatchCommissioningError:[MTRError errorForCHIPErrorCode:CHIP_ERROR_NO_MEMORY]];
504+
return;
505+
}
506+
// We are already on _delegateQueue, so write the backing ivar
507+
// directly under _stateLock rather than going through the
508+
// setter (which would re-enter the cancel path on a NO->YES
509+
// transition only -- but using the lock here keeps the
510+
// reader/writer contract uniform).
511+
os_unfair_lock_lock(&self->_stateLock);
512+
self->_isWaitingAfterPASEEstablished = YES;
513+
os_unfair_lock_unlock(&self->_stateLock);
267514
[strongDelegate commissioning:self paseSessionEstablishmentComplete:error];
268515
});
269516

@@ -502,3 +749,28 @@ - (void)deviceAttestationCompletedForController:(MTRDeviceController *)controlle
502749
}
503750

504751
@end
752+
753+
#pragma mark - PostPASEWatchdogTesting
754+
755+
@implementation MTRCommissioningOperation (PostPASEWatchdogTesting)
756+
757+
+ (void)setPostPASEWatchdogIntervalForTesting:(NSTimeInterval)interval
758+
{
759+
// Negative values are nonsensical; clamp to 0 (== "use production
760+
// interval"). The override is read on _delegateQueue when arming a
761+
// new watchdog; tests are expected to set this before kicking off
762+
// the commissioning flow they want to observe.
763+
sMTRPostPASEWatchdogIntervalForTesting = (interval > 0) ? interval : 0;
764+
}
765+
766+
- (void)setForceNextArmFailureForTesting:(BOOL)force
767+
{
768+
// _forceNextArmFailureForTesting is read/written only on
769+
// _delegateQueue (matching _postPASEWatchdog). Bounce onto that
770+
// queue so the test does not have to know which queue it is on.
771+
dispatch_async(_delegateQueue, ^{
772+
self->_forceNextArmFailureForTesting = force;
773+
});
774+
}
775+
776+
@end

src/darwin/Framework/CHIP/MTRCommissioningOperation_Internal.h

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,14 +38,30 @@ NS_ASSUME_NONNULL_BEGIN
3838

3939
@property (nonatomic, readonly, assign) BOOL isInternallyCreated;
4040

41+
// The commissioning identifier this operation was constructed with (the
42+
// random "future node ID" we use to track the commissioning flow before the
43+
// commissionee has been assigned a real node ID). Exposed so callers like
44+
// MTRDeviceController_Concrete that need to drive cancellation against this
45+
// commissioning have something to pass to StopPairing without reaching into
46+
// private ivars.
47+
@property (nonatomic, readonly, copy) NSNumber * commissioningID;
48+
4149
// True if the commissioning is waiting to resume after PASE has been
4250
// established and the delegate chose to be notified about that.
4351
//
4452
// This is currently only true if isInternallyCreated, and is readwrite because
4553
// MTRDeviceController_Concrete helps maintain this state.
4654
//
47-
// This property should generally be written on client queues only, not on the
48-
// Matter queue.
55+
// Threading: reads and writes go through the property, which is internally
56+
// guarded by an os_unfair_lock so the value the setter just wrote is
57+
// observable to a synchronous reader on any queue. The paired post-PASE
58+
// watchdog timer (an internal implementation detail) is owned by the
59+
// operation's _delegateQueue; the setter clears the flag synchronously and
60+
// bounces the watchdog teardown side effect onto _delegateQueue. The
61+
// watchdog event_handler consults isWaitingAfterPASEEstablished as a
62+
// late-fire guard to avoid a spurious CHIP_ERROR_TIMEOUT when the client
63+
// has legitimately advanced past the post-PASE waiting state but a timer
64+
// block was already enqueued ahead of the cancel.
4965
@property (nonatomic, readwrite, assign) BOOL isWaitingAfterPASEEstablished;
5066

5167
@end

0 commit comments

Comments
 (0)