Skip to content

Commit 8227502

Browse files
committed
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client implements commissioning:paseSessionEstablishmentComplete: but then drops the flow without ever calling commissionNodeWithID: or cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete keeps the operation alive via currentInternalCommissioning, so every subsequent commissioning attempt hits the busy check in startWithController and fails with CHIP_ERROR_BUSY (0xDB). The user's symptom is "Matter accessories refuse to connect" indefinitely until the daemon (or app) is restarted. Add a 5-minute dispatch_source_t watchdog armed when we set isWaitingAfterPASEEstablished = YES and cancelled when: - the property is set back to NO (via the new custom setter that fires whenever commissionNodeWithID: succeeds on the controller), - the operation is stopped, - any terminal _dispatchCommissioningError path runs, - or the operation deallocates. On expiry, the watchdog stops the in-flight commissioning via stopCommissioning:forCommissioningID: (which routes through StopPairing in both the legacy isInternallyCreated:YES flow AND the public-API flow) and routes a CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError path so client state can settle back to "no commissioning in flight". Five minutes is comfortably longer than the commissionee fail-safe (60-180 seconds) but short enough that the user is unblocked within a single sit-down troubleshooting session. Also add a defensive secondary patch in setupCommissioningSessionWithPayload:newNodeID:error: -- if a stale internally-created commissioning is still parked at the post-PASE waiting state with a *different* setup payload, treat the new call as an implicit cancel of the previous one. Without this, the new call would have failed with CHIP_ERROR_BUSY because the controller still considered the prior commissioning in flight. The implicit cancel propagates any StopPairing failure back to the caller rather than silently starting a new commissioning that is about to hit the same busy state. The implicit cancel needs the prior operation's commissioningID, so a small read-only commissioningID accessor is added to MTRCommissioningOperation_Internal.h. Threading: - The isWaitingAfterPASEEstablished setter writes the flag synchronously under an os_unfair_lock so a caller that immediately reads it back observes the updated value, then bounces only the watchdog teardown side effect onto _delegateQueue (which is the queue that owns the dispatch_source_t timer). - The watchdog event_handler consults isWaitingAfterPASEEstablished as a late-fire guard: if the client has legitimately advanced past the post-PASE waiting state but a timer block was already posted to the queue ahead of the cancel, the handler swallows the late fire instead of routing a spurious CHIP_ERROR_TIMEOUT into a now-active flow. - -stop runs stopCommissioning:forCommissioningID: BEFORE clearing isWaitingAfterPASEEstablished. A naive clear-then-stop ordering would let the watchdog event_handler observe (flag=NO, currentCommissioning=self) and suppress legitimate stop-driven cleanup; the controller-transition-first ordering ensures either the late-fire guard correctly suppresses (when the controller transition has already happened) or the fire path runs idempotently (since duplicate stopCommissioning: returns NO and the watchdog suppresses the timeout via the "replaced" check). - _dispatchCommissioningError clears isWaitingAfterPASEEstablished synchronously before bouncing the watchdog cancel, so a timer event already enqueued on _delegateQueue is suppressed by the late-fire guard. - The fire path uses stopCommissioning:forCommissioningID: rather than cancelCommissioningForNodeID: so StopPairing runs on the CHIP layer in both the isInternallyCreated:YES and isInternallyCreated:NO flows. In the fire path we distinguish two failure modes for the BOOL return: "replaced" (currentCommissioning != self -- a successor has taken the slot, suppress the timeout) versus "genuine StopPairing failure" (currentCommissioning == self but the C++ StopPairing returned an error, e.g. because the commissionee fail-safe (60-180s) has already expired before the 5-minute watchdog fires). In the genuine-failure case we MUST still escalate the timeout to the client -- otherwise the very wedge the watchdog exists to break is left in place silently. - If dispatch_source_create fails when arming the watchdog, the establishment-done path now routes a CHIP_ERROR_NO_MEMORY through the standard failure path rather than entering the unbounded wait state silently.
1 parent efc29a8 commit 8227502

6 files changed

Lines changed: 3856 additions & 5 deletions

File tree

src/darwin/Framework/CHIP/MTRCommissioningOperation.mm

Lines changed: 309 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,12 @@
1616

1717
#import <Foundation/Foundation.h>
1818
#import <Matter/MTRDeviceAttestationDelegate.h>
19+
#import <os/lock.h>
1920

2021
#import "MTRCommissioningDelegate_Internal.h"
2122
#import "MTRCommissioningOperation.h"
2223
#import "MTRCommissioningOperation_Internal.h"
24+
#import "MTRCommissioningOperation_Test.h"
2325
#import "MTRDefines_Internal.h"
2426
#import "MTRDeviceControllerDelegate_Internal.h"
2527
#import "MTRDeviceController_Concrete.h"
@@ -36,6 +38,19 @@
3638
using namespace chip;
3739
using namespace chip::Tracing::DarwinFramework;
3840

41+
// Production interval after which the post-PASE watchdog fires. Five
42+
// minutes is comfortably longer than the commissionee fail-safe (60-180
43+
// seconds) but short enough that the user gets unblocked well within a
44+
// single sit-down troubleshooting session. Exposed via
45+
// MTRCommissioningOperation_Test.h so tests can assert against the same
46+
// value production uses.
47+
extern const NSTimeInterval kMTRPostPASEWatchdogInterval = 5 * 60;
48+
49+
// Test-only override for kMTRPostPASEWatchdogInterval. Non-zero values
50+
// take effect for all subsequently armed watchdogs in the current
51+
// process; production reads kMTRPostPASEWatchdogInterval directly.
52+
static NSTimeInterval sMTRPostPASEWatchdogIntervalForTesting = 0;
53+
3954
@interface MTRCommissioningOperationDeviceAttestationDelegate : NSObject <MTRDeviceAttestationDelegate>
4055
@property (nonatomic, weak) MTRCommissioningOperation * commissioningOperation;
4156
@end
@@ -49,6 +64,67 @@ @implementation MTRCommissioningOperation {
4964
id<MTRCommissioningDelegate> __weak _delegate;
5065
dispatch_queue_t _delegateQueue;
5166
MTRDeviceController_Concrete * __weak _controller;
67+
// Watchdog timer armed once PASE has been established to bound how long we
68+
// hold the controller in the "in-progress commissioning" state if the
69+
// client drops the flow without ever calling commissionNodeWithID: or
70+
// cancelCommissioningForNodeID:. Without this watchdog,
71+
// _currentInternalCommissioning is never cleared and every subsequent
72+
// commissioning attempt hits CHIP_ERROR_BUSY (0xDB) until the process
73+
// restarts.
74+
//
75+
// _postPASEWatchdog is read/written only on _delegateQueue.
76+
dispatch_source_t _postPASEWatchdog;
77+
// Backing ivar for the isWaitingAfterPASEEstablished property. Reads and
78+
// writes are serialized by _stateLock so callers from any queue see a
79+
// consistent value immediately after the setter returns; the watchdog
80+
// teardown that runs as a side effect of clearing the flag is bounced
81+
// onto _delegateQueue (which is the only queue allowed to touch
82+
// _postPASEWatchdog).
83+
BOOL _isWaitingAfterPASEEstablished;
84+
os_unfair_lock _stateLock;
85+
// Test-only one-shot fault injection. When YES, the next call to
86+
// _armPostPASEWatchdog will simulate a dispatch_source_create failure
87+
// (return NO without arming the timer) and clear the flag. Read and
88+
// written only on _delegateQueue, mirroring _postPASEWatchdog itself.
89+
BOOL _forceNextArmFailureForTesting;
90+
}
91+
92+
@synthesize commissioningID = _commissioningID;
93+
94+
- (BOOL)isWaitingAfterPASEEstablished
95+
{
96+
os_unfair_lock_lock(&_stateLock);
97+
BOOL value = _isWaitingAfterPASEEstablished;
98+
os_unfair_lock_unlock(&_stateLock);
99+
return value;
100+
}
101+
102+
- (void)setIsWaitingAfterPASEEstablished:(BOOL)isWaitingAfterPASEEstablished
103+
{
104+
// Write the flag synchronously under _stateLock so that any caller that
105+
// immediately reads back isWaitingAfterPASEEstablished observes the
106+
// updated value (callers like MTRDeviceController_Concrete depend on
107+
// this caller-visible ordering). Only the watchdog teardown side
108+
// effect is bounced onto _delegateQueue, since _postPASEWatchdog itself
109+
// is _delegateQueue-owned.
110+
BOOL changedToNo = NO;
111+
os_unfair_lock_lock(&_stateLock);
112+
if (_isWaitingAfterPASEEstablished != isWaitingAfterPASEEstablished) {
113+
_isWaitingAfterPASEEstablished = isWaitingAfterPASEEstablished;
114+
changedToNo = !isWaitingAfterPASEEstablished;
115+
}
116+
os_unfair_lock_unlock(&_stateLock);
117+
118+
if (changedToNo) {
119+
// The client has either started commissioning (via
120+
// commissionNodeWithID:) or moved on; the watchdog no longer needs
121+
// to fire. The cancel must happen on _delegateQueue because that
122+
// is where _postPASEWatchdog is owned and where the timer's
123+
// event_handler is delivered.
124+
dispatch_async(_delegateQueue, ^{
125+
[self _cancelPostPASEWatchdog];
126+
});
127+
}
52128
}
53129

54130
- (instancetype)initWithParameters:(MTRCommissioningParameters *)parameters
@@ -97,13 +173,158 @@ - (instancetype)initWithParameters:(MTRCommissioningParameters *)parameters
97173
_isInternallyCreated = isInternallyCreated;
98174
_delegate = delegate;
99175
_delegateQueue = queue;
176+
_stateLock = OS_UNFAIR_LOCK_INIT;
100177

101178
// Don't hold on to the provided attestation delegate, which we never use.
102179
_parameters.deviceAttestationDelegate = nil;
103180

104181
return self;
105182
}
106183

184+
- (void)dealloc
185+
{
186+
// Defensive: ensure the watchdog timer doesn't outlive us. In all
187+
// non-dealloc code paths, _postPASEWatchdog is read/written only on
188+
// _delegateQueue. In dealloc itself the timer source's event_handler
189+
// cannot still be in flight (the handler retains self via mtr_strongify
190+
// for its duration), so a synchronous cancel here is safe regardless
191+
// of which queue is executing dealloc.
192+
if (_postPASEWatchdog) {
193+
dispatch_source_cancel(_postPASEWatchdog);
194+
_postPASEWatchdog = nil;
195+
}
196+
}
197+
198+
- (BOOL)_armPostPASEWatchdog
199+
{
200+
// _postPASEWatchdog is owned by _delegateQueue; production callers
201+
// invoke this from inside that queue.
202+
203+
// Test-only one-shot fault injection: simulate a
204+
// dispatch_source_create failure so callers can exercise the
205+
// "watchdog could not be armed -> bail out" path without having to
206+
// actually exhaust dispatch sources.
207+
if (_forceNextArmFailureForTesting) {
208+
_forceNextArmFailureForTesting = NO;
209+
MTR_LOG_ERROR("%@ post-PASE watchdog arm forced to fail (testing)", self);
210+
return NO;
211+
}
212+
213+
// Production interval lives in kMTRPostPASEWatchdogInterval; tests
214+
// can shorten it via setPostPASEWatchdogIntervalForTesting:.
215+
NSTimeInterval intervalSeconds = (sMTRPostPASEWatchdogIntervalForTesting > 0)
216+
? sMTRPostPASEWatchdogIntervalForTesting
217+
: kMTRPostPASEWatchdogInterval;
218+
219+
if (_postPASEWatchdog) {
220+
// Already armed; should not happen but defend against double-arm.
221+
return YES;
222+
}
223+
224+
_postPASEWatchdog = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, _delegateQueue);
225+
if (!_postPASEWatchdog) {
226+
MTR_LOG_ERROR("%@ failed to create post-PASE watchdog timer", self);
227+
return NO;
228+
}
229+
230+
dispatch_source_set_timer(_postPASEWatchdog,
231+
dispatch_time(DISPATCH_TIME_NOW, (int64_t) (intervalSeconds * NSEC_PER_SEC)),
232+
DISPATCH_TIME_FOREVER,
233+
1 * NSEC_PER_SEC);
234+
235+
mtr_weakify(self);
236+
dispatch_source_set_event_handler(_postPASEWatchdog, ^{
237+
mtr_strongify(self);
238+
if (!self) {
239+
return;
240+
}
241+
// Late-fire guard: if the client has already advanced past the
242+
// post-PASE waiting state (e.g. the client's commissionNodeWithID:
243+
// succeeded and the controller flipped isWaitingAfterPASEEstablished
244+
// back to NO) but the timer block had already been posted to the
245+
// queue ahead of the cancel, we must NOT route a spurious
246+
// CHIP_ERROR_TIMEOUT through the failure path -- doing so would
247+
// tear down a commissioning that is now legitimately in flight.
248+
if (!self.isWaitingAfterPASEEstablished) {
249+
return;
250+
}
251+
MTR_LOG_ERROR("%@ post-PASE watchdog fired -- client never advanced past paseSessionEstablishmentComplete; cancelling commissioning to release controller", self);
252+
[self _firePostPASEWatchdog];
253+
});
254+
255+
dispatch_resume(_postPASEWatchdog);
256+
return YES;
257+
}
258+
259+
- (void)_cancelPostPASEWatchdog
260+
{
261+
// All access to _postPASEWatchdog is serialized on _delegateQueue
262+
// in production paths; tests may invoke this helper directly so we
263+
// do not assert.
264+
265+
if (_postPASEWatchdog) {
266+
dispatch_source_cancel(_postPASEWatchdog);
267+
_postPASEWatchdog = nil;
268+
}
269+
}
270+
271+
- (void)_firePostPASEWatchdog
272+
{
273+
// The handler is delivered on _delegateQueue. Self-cancel by asking the
274+
// controller to stop commissioning for our commissioningID (which routes
275+
// through StopPairing in both internal and non-internal flows), then
276+
// notify the delegate via the standard error path so any UI / client
277+
// state can settle back to "no commissioning in flight".
278+
279+
[self _cancelPostPASEWatchdog];
280+
281+
MTRDeviceController_Concrete * strongController = _controller;
282+
if (strongController) {
283+
// Distinguish two failure modes for stopCommissioning:forCommissioningID::
284+
//
285+
// (1) "Replaced": the commissioning slot has already been taken by a
286+
// successor MTRCommissioningOperation, so currentCommissioning
287+
// on the controller is no longer self. In that case we must NOT
288+
// dispatch a timeout, because the wedge has already been resolved
289+
// by the replacement and a timeout dispatched here would tear
290+
// down the unrelated successor's delegate state.
291+
//
292+
// (2) "Genuine StopPairing failure": currentCommissioning is still
293+
// self but the C++ StopPairing returned a non-OK error (e.g.
294+
// CHIP_ERROR_INVALID_DEVICE_DESCRIPTOR if the commissionee's
295+
// fail-safe (60-180s) has already expired before the 5-minute
296+
// watchdog fires). In this case the wedge persists, so we MUST
297+
// still escalate the timeout to the client -- otherwise the very
298+
// wedge the watchdog exists to break is left in place silently.
299+
//
300+
// stopCommissioning:forCommissioningID: returns NO in BOTH cases, so we
301+
// must distinguish them by checking currentCommissioning ourselves.
302+
BOOL replaced = (strongController.currentCommissioning != self);
303+
if (replaced) {
304+
// Replacement path already cleared our delegate via
305+
// commissioningDone:, so just log and bail.
306+
MTR_LOG("%@ post-PASE watchdog fired but commissioning was already replaced; suppressing spurious timeout", self);
307+
return;
308+
}
309+
310+
// Use stopCommissioning:forCommissioningID: rather than
311+
// cancelCommissioningForNodeID:. The latter only does anything for
312+
// the legacy isInternallyCreated:YES flow, but the watchdog is
313+
// armed for any client that implements
314+
// commissioning:paseSessionEstablishmentComplete:, regardless of
315+
// creation path. stopCommissioning: invokes StopPairing on the
316+
// CHIP layer in both cases so the controller actually releases
317+
// its in-progress commissioning slot. We deliberately ignore the
318+
// return value here: even if StopPairing failed at the CHIP layer
319+
// (e.g. fail-safe already expired), we still need to surface the
320+
// timeout to the client so it can settle back to "no commissioning
321+
// in flight" rather than waiting forever.
322+
(void) [strongController stopCommissioning:self forCommissioningID:_commissioningID];
323+
}
324+
325+
[self _dispatchCommissioningError:[MTRError errorForCHIPErrorCode:CHIP_ERROR_TIMEOUT]];
326+
}
327+
107328
static inline void emitMetricForSetupPayload(NSString * payload)
108329
{
109330
std::vector<SetupPayload> payloads;
@@ -170,12 +391,35 @@ - (void)startWithController:(MTRDeviceController *)controller
170391
- (BOOL)stop
171392
{
172393
MTRDeviceController_Concrete * strongController = _controller;
394+
BOOL stopResult;
173395
if (!strongController) {
174396
// Nothing to do; controller is gone, so we are stopped no matter what.
175-
return NO;
397+
stopResult = NO;
398+
} else {
399+
stopResult = [strongController stopCommissioning:self forCommissioningID:_commissioningID];
176400
}
177401

178-
return [strongController stopCommissioning:self forCommissioningID:_commissioningID];
402+
// Clear the waiting-after-PASE flag synchronously AFTER the controller
403+
// transition. If we cleared it before stopCommissioning:, the watchdog's
404+
// late-fire guard could observe (flag=NO, currentCommissioning=self) and
405+
// suppress a fire that legitimately needed to drive cleanup. By clearing
406+
// after the controller transition, either:
407+
// - the watchdog fires before our clear: it sees flag=YES and runs its
408+
// normal fire path, which is fine because stopCommissioning above is
409+
// idempotent (a duplicate stopCommissioning: returns NO and the
410+
// watchdog suppresses the timeout via the "replaced" check), OR
411+
// - the watchdog fires after our clear: the late-fire guard suppresses
412+
// it, also fine, since we have already done the controller transition.
413+
// Then bounce the cancel of _postPASEWatchdog itself onto _delegateQueue
414+
// so we don't race with the timer's event_handler.
415+
os_unfair_lock_lock(&_stateLock);
416+
_isWaitingAfterPASEEstablished = NO;
417+
os_unfair_lock_unlock(&_stateLock);
418+
dispatch_async(_delegateQueue, ^{
419+
[self _cancelPostPASEWatchdog];
420+
});
421+
422+
return stopResult;
179423
}
180424

181425
- (void)_earlyFailCommissioning:(CHIP_ERROR)error
@@ -208,6 +452,18 @@ - (void)_dispatchCommissioningError:(NSError *)error withMetrics:(MTRMetrics *)m
208452

209453
- (void)_dispatchCommissioningError:(NSError *)error forCommissioningID:(NSNumber *)commissioningID withMetrics:(MTRMetrics *)metrics
210454
{
455+
// Any terminal error path implies the watchdog (if armed) no longer needs
456+
// to fire. Clear the waiting-after-PASE flag synchronously so that any
457+
// already-enqueued watchdog event_handler is suppressed by the late-fire
458+
// guard, then bounce the dispatch_source cancel itself onto _delegateQueue
459+
// (this method may be invoked on the Matter / CHIP thread).
460+
os_unfair_lock_lock(&_stateLock);
461+
_isWaitingAfterPASEEstablished = NO;
462+
os_unfair_lock_unlock(&_stateLock);
463+
dispatch_async(_delegateQueue, ^{
464+
[self _cancelPostPASEWatchdog];
465+
});
466+
211467
MTRDeviceController_Concrete * strongController = _controller;
212468

213469
MTR_LOG("%@ Device commissioning failed with controller %@ metrics %@", self, strongController, metrics);
@@ -263,7 +519,32 @@ - (void)controller:(MTRDeviceController *)controller commissioningSessionEstabli
263519
// commissioning ourselves if not.
264520
if ([strongDelegate respondsToSelector:@selector(commissioning:paseSessionEstablishmentComplete:)]) {
265521
dispatch_async(_delegateQueue, ^{
266-
self.isWaitingAfterPASEEstablished = YES;
522+
// Arm the watchdog inside this block so that the arm and the
523+
// paired ivar write below are both serialized on _delegateQueue
524+
// with all subsequent _postPASEWatchdog mutations. A client
525+
// that handles paseSessionEstablishmentComplete: but then
526+
// loses interest -- without ever calling commissionNodeWithID:
527+
// or cancelCommissioningForNodeID: -- would otherwise
528+
// permanently wedge the controller's commissioning state.
529+
BOOL armed = [self _armPostPASEWatchdog];
530+
if (!armed) {
531+
// Failing to arm the watchdog means we cannot bound the
532+
// post-PASE wait at all -- that is the very wedge this
533+
// logic exists to prevent. Surface a no-memory error so
534+
// the client tears down rather than going into the
535+
// unbounded wait state.
536+
MTR_LOG_ERROR("%@ failed to arm post-PASE watchdog; aborting commissioning to avoid an unbounded wait", self);
537+
[self _dispatchCommissioningError:[MTRError errorForCHIPErrorCode:CHIP_ERROR_NO_MEMORY]];
538+
return;
539+
}
540+
// We are already on _delegateQueue, so write the backing ivar
541+
// directly under _stateLock rather than going through the
542+
// setter (which would re-enter the cancel path on a NO->YES
543+
// transition only -- but using the lock here keeps the
544+
// reader/writer contract uniform).
545+
os_unfair_lock_lock(&self->_stateLock);
546+
self->_isWaitingAfterPASEEstablished = YES;
547+
os_unfair_lock_unlock(&self->_stateLock);
267548
[strongDelegate commissioning:self paseSessionEstablishmentComplete:error];
268549
});
269550

@@ -502,3 +783,28 @@ - (void)deviceAttestationCompletedForController:(MTRDeviceController *)controlle
502783
}
503784

504785
@end
786+
787+
#pragma mark - PostPASEWatchdogTesting
788+
789+
@implementation MTRCommissioningOperation (PostPASEWatchdogTesting)
790+
791+
+ (void)setPostPASEWatchdogIntervalForTesting:(NSTimeInterval)interval
792+
{
793+
// Negative values are nonsensical; clamp to 0 (== "use production
794+
// interval"). The override is read on _delegateQueue when arming a
795+
// new watchdog; tests are expected to set this before kicking off
796+
// the commissioning flow they want to observe.
797+
sMTRPostPASEWatchdogIntervalForTesting = (interval > 0) ? interval : 0;
798+
}
799+
800+
- (void)setForceNextArmFailureForTesting:(BOOL)force
801+
{
802+
// _forceNextArmFailureForTesting is read/written only on
803+
// _delegateQueue (matching _postPASEWatchdog). Bounce onto that
804+
// queue so the test does not have to know which queue it is on.
805+
dispatch_async(_delegateQueue, ^{
806+
self->_forceNextArmFailureForTesting = force;
807+
});
808+
}
809+
810+
@end

0 commit comments

Comments
 (0)