Skip to content

Commit 3038774

Browse files
woody-appleterafin
authored andcommitted
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client (CHIP framework consumer) implements commissioning:paseSessionEstablishmentComplete: but then never calls commissionNodeWithID: or cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete keeps the operation alive via currentInternalCommissioning, so every subsequent commissioning attempt hits the busy check in startWithController and fails with CHIP_ERROR_BUSY. User-visible symptom: Matter accessories refuse to commission until the daemon (or app) is restarted. Add a 5-minute dispatch_source_t watchdog armed when isWaitingAfterPASEEstablished is set to YES, cancelled on legitimate advance, stop, or terminal error. On expiry, calls stopCommissioning:forCommissioningID: and routes CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError path so client state can settle. Clear isWaitingAfterPASEEstablished BEFORE the controller stop call so -_cancelCommissioning does not synthesize a CANCELLED OnCommissioningComplete -- the wedge IS that no notifications arrived, so suppressing the synthetic CANCELLED avoids a redundant MATTER_LOG_METRIC_END(kMetricDeviceCommissioning) and a spurious CANCELLED on _delegateQueue ahead of the TIMEOUT we want to deliver. Defensive secondary fix in setupCommissioningSessionWithPayload:newNodeID:error:: if a stale internally-created commissioning is still parked at the post-PASE waiting state with a different setup payload, treat the new call as an implicit cancel of the previous one rather than failing the new call with BUSY. Perform the implicit cancel BEFORE MATTER_LOG_METRIC_BEGIN(kMetricDeviceCommissioning) and the metrics reset for the new attempt -- otherwise the synthetic CANCELLED end emitted by the bridge from the stale operation would close out the new attempt's metric, leaving it without a matching begin/end and skewing telemetry. This also keeps the cancel-failure early return from leaving kMetricDeviceCommissioning open across the returned NO. Threading hardening from review: * atomic test-only watchdog interval override so concurrent TSan setters on any queue race-free with the production read in _armPostPASEWatchdog * clamp non-finite (NaN / +/-INFINITY) test interval overrides to 0 (use production interval) so the multiply-by-NSEC_PER_SEC and cast to int64_t in -_armPostPASEWatchdog stays defined behavior * atomic stop via stopCommissioningAtomically:forCommissioningID:, closing the TOCTOU window between currentCommissioning read and StopPairing * synchronous _cancelPostPASEWatchdog in -stop so tests can observe the teardown immediately after -stop returns Honor the @Property (copy) declaration on -setupPayload by [copy]'ing the input in the designated initializer; mutable inputs were aliased. Regression coverage: * test_PostPASEWatchdogIntervalResetToZeroRestoresProductionInterval -- pins that +setPostPASEWatchdogIntervalForTesting:0 actually restores the production interval after a non-zero override. * test_ForceArmFailureFlag_ConcurrentSettersAndArmer_NoTSan -- TSan smoke test for the per-instance _forceNextArmFailureForTesting flag. * test_WatchdogCancel_IsSynchronousOnSelfFire -- pins the round-2 sync-cancel contract for the watchdog's own fire path; -_firePostPASEWatchdog must synchronously tear down _postPASEWatchdog at the top of the handler.
1 parent 8a162c6 commit 3038774

8 files changed

Lines changed: 5121 additions & 40 deletions

File tree

src/darwin/Framework/CHIP/MTRCommissioningOperation.mm

Lines changed: 471 additions & 5 deletions
Large diffs are not rendered by default.

src/darwin/Framework/CHIP/MTRCommissioningOperation_Internal.h

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,16 +38,46 @@ NS_ASSUME_NONNULL_BEGIN
3838

3939
@property (nonatomic, readonly, assign) BOOL isInternallyCreated;
4040

41+
// The commissioning identifier this operation was constructed with (the
42+
// random "future node ID" we use to track the commissioning flow before the
43+
// commissionee has been assigned a real node ID). Exposed so callers like
44+
// MTRDeviceController_Concrete that need to drive cancellation against this
45+
// commissioning have something to pass to StopPairing without reaching into
46+
// private ivars.
47+
@property (nonatomic, readonly, copy) NSNumber * commissioningID;
48+
4149
// True if the commissioning is waiting to resume after PASE has been
4250
// established and the delegate chose to be notified about that.
4351
//
4452
// This is currently only true if isInternallyCreated, and is readwrite because
4553
// MTRDeviceController_Concrete helps maintain this state.
4654
//
47-
// This property should generally be written on client queues only, not on the
48-
// Matter queue.
55+
// Threading: reads and writes go through the property, which is internally
56+
// guarded by an os_unfair_lock so the value the setter just wrote is
57+
// observable to a synchronous reader on any queue. The paired post-PASE
58+
// watchdog timer (an internal implementation detail) is owned by the
59+
// operation's _delegateQueue; the setter clears the flag synchronously and
60+
// bounces the watchdog teardown side effect onto _delegateQueue. The
61+
// watchdog event_handler consults isWaitingAfterPASEEstablished as a
62+
// late-fire guard to avoid a spurious CHIP_ERROR_TIMEOUT when the client
63+
// has legitimately advanced past the post-PASE waiting state but a timer
64+
// block was already enqueued ahead of the cancel.
4965
@property (nonatomic, readwrite, assign) BOOL isWaitingAfterPASEEstablished;
5066

67+
// Synchronously tear down the post-PASE watchdog timer. Safe to call from
68+
// any queue (the helper is internally lock-guarded). Exposed for
69+
// MTRDeviceController_Concrete's implicit-cancel path in
70+
// -setupCommissioningSessionWithPayload:newNodeID:error:, which must guarantee
71+
// that no leftover watchdog timer block can fire on _delegateQueue AFTER the
72+
// new attempt's MATTER_LOG_METRIC_BEGIN(kMetricDeviceCommissioning) -- the
73+
// timer's event_handler routes through -_dispatchCommissioningError, which
74+
// would otherwise end the NEW attempt's metric span and corrupt telemetry.
75+
//
76+
// The setter on isWaitingAfterPASEEstablished bounces the watchdog teardown
77+
// onto _delegateQueue via dispatch_async, so simply flipping the flag does
78+
// not provide a synchronous guarantee. This helper does.
79+
- (void)_syncCancelPostPASEWatchdog;
80+
5181
@end
5282

5383
NS_ASSUME_NONNULL_END
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
/**
2+
* Copyright (c) 2026 Project CHIP Authors
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
#import <Foundation/Foundation.h>
18+
#import <Matter/MTRCommissioningOperation.h>
19+
20+
NS_ASSUME_NONNULL_BEGIN
21+
22+
// Production interval after which the post-PASE watchdog fires. Exposed so
23+
// tests can assert against the same constant the production code uses.
24+
extern const NSTimeInterval kMTRPostPASEWatchdogInterval;
25+
26+
@interface MTRCommissioningOperation (PostPASEWatchdogTesting)
27+
28+
// Overrides the post-PASE watchdog interval for all subsequently armed
29+
// watchdogs in the current process. Pass 0 to restore the production
30+
// interval (kMTRPostPASEWatchdogInterval). Intended only for tests that
31+
// need to exercise the watchdog without waiting five minutes.
32+
+ (void)setPostPASEWatchdogIntervalForTesting:(NSTimeInterval)interval;
33+
34+
// One-shot fault injection: if set to YES, the next call to the internal
35+
// _armPostPASEWatchdog helper on this instance will simulate a
36+
// dispatch_source_create failure and return NO without arming the timer.
37+
// The flag is cleared as soon as it is observed, so each call to this
38+
// setter injects exactly one failure.
39+
- (void)setForceNextArmFailureForTesting:(BOOL)force;
40+
41+
// Test-only accessor for the _postPASEWatchdog dispatch_source ivar.
42+
// Returns the source under _stateLock so tests can verify synchronous
43+
// cancel semantics after -stop: returns (the cancel in -stop is now
44+
// synchronous; tests can read this immediately after -stop returns and
45+
// expect nil). Do NOT use this to drive production logic; this exists
46+
// purely so test_WatchdogCancel_IsSynchronousOnStop can pin the
47+
// synchronous-cancel contract.
48+
- (nullable dispatch_source_t)postPASEWatchdogForTesting;
49+
50+
@end
51+
52+
NS_ASSUME_NONNULL_END

src/darwin/Framework/CHIP/MTRDeviceController_Concrete.h

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,44 @@ NS_ASSUME_NONNULL_BEGIN
255255
*/
256256
- (BOOL)stopCommissioning:(MTRCommissioningOperation *)commissioning forCommissioningID:(NSNumber *)commissioningID;
257257

258+
/**
259+
* Outcome of a -stopCommissioningAtomically:forCommissioningID: call.
260+
*
261+
* Disambiguates the two failure modes of -stopCommissioning:forCommissioningID:
262+
* (which collapses both to NO) so the post-PASE watchdog fire path can
263+
* decide whether to escalate a CHIP_ERROR_TIMEOUT to the client.
264+
*/
265+
typedef NS_ENUM(NSInteger, MTRStopCommissioningOutcome) {
266+
/// We were the current commissioning and StopPairing succeeded (or no
267+
/// StopPairing was needed because we had not yet reached the C++ layer).
268+
MTRStopCommissioningOutcomeStopped,
269+
/// Another MTRCommissioningOperation has already taken our slot on the
270+
/// controller (currentCommissioning != self). Caller must NOT dispatch
271+
/// a terminal error in this case -- the wedge has already been resolved
272+
/// by the replacement and a timeout dispatched here would tear down the
273+
/// unrelated successor's delegate state.
274+
MTRStopCommissioningOutcomeReplaced,
275+
/// We were the current commissioning but StopPairing failed at the CHIP
276+
/// layer (e.g. the commissionee fail-safe expired before the watchdog
277+
/// fired). Caller MUST still surface a terminal error so the client can
278+
/// settle back to "no commissioning in flight".
279+
MTRStopCommissioningOutcomeFailedStillCurrent,
280+
};
281+
282+
/**
283+
* Atomic variant of -stopCommissioning:forCommissioningID: that distinguishes
284+
* the "replaced" and "failed-but-still-current" outcomes from the success
285+
* outcome. The replaced-vs-current check and the StopPairing call are
286+
* performed under the same controller lock, eliminating the TOCTOU window
287+
* between an external check of currentCommissioning and the subsequent stop.
288+
*
289+
* Used by the post-PASE watchdog fire path to suppress a spurious
290+
* CHIP_ERROR_TIMEOUT when the operation slot has already been taken by a
291+
* successor.
292+
*/
293+
- (MTRStopCommissioningOutcome)stopCommissioningAtomically:(MTRCommissioningOperation *)commissioning
294+
forCommissioningID:(NSNumber *)commissioningID;
295+
258296
/**
259297
* Notification that a commissioning operation is done.
260298
*/

src/darwin/Framework/CHIP/MTRDeviceController_Concrete.mm

Lines changed: 125 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,19 @@ @implementation MTRDeviceController_Concrete {
146146
// Keep track of dns-sd resolution objects for shutdown-time cleanup.
147147
os_unfair_lock _deviceConnectivityMonitorLock;
148148
NSHashTable<MTRDeviceConnectivityMonitor *> * _weakSetOfDeviceConnectivityMonitors;
149+
150+
// Guards the read-currentCommissioning + StopPairing sequence in
151+
// -stopCommissioningAtomically:forCommissioningID:. Without this,
152+
// the post-PASE watchdog fire path has a TOCTOU between checking
153+
// `currentCommissioning != self` (the "replaced" gate) and the
154+
// subsequent StopPairing call: a concurrent
155+
// setupCommissioningSessionWithPayload: on another thread can swap
156+
// currentCommissioning between the two reads and the watchdog
157+
// either dispatches a spurious CHIP_ERROR_TIMEOUT to a now-detached
158+
// delegate or fails to release a still-current C++ commissioning
159+
// slot. This lock makes the check-and-act sequence atomic w.r.t.
160+
// any other writer of currentCommissioning that also acquires it.
161+
os_unfair_lock _currentCommissioningLock;
149162
}
150163

151164
// TODO: Figure out whether the work queue storage lives here or in the superclass
@@ -205,6 +218,7 @@ - (nullable instancetype)initWithFactory:(MTRDeviceControllerFactory *)factory
205218
_currentCommissioning = nil;
206219
_commissioningQueue = dispatch_queue_create("org.csa-iot.matter.framework.commissioning.workqueue", DISPATCH_QUEUE_SERIAL_WITH_AUTORELEASE_POOL);
207220
_deviceConnectivityMonitorLock = OS_UNFAIR_LOCK_INIT;
221+
_currentCommissioningLock = OS_UNFAIR_LOCK_INIT;
208222

209223
if (storageDelegate != nil) {
210224
if (storageDelegateQueue == nil) {
@@ -861,18 +875,87 @@ - (BOOL)setupCommissioningSessionWithPayload:(MTRSetupPayload *)payload
861875

862876
MTR_LOG("%@ Setting up commissioning session for device ID 0x%016llX with setup payload %@", self, newNodeID.unsignedLongLongValue, payload);
863877

878+
// Compute the pairing code first so we can compare against any stale
879+
// post-PASE commissioning before we touch metrics state.
880+
NSString * earlyPairingCode = [payload qrCodeString:nil];
881+
if (earlyPairingCode == nil) {
882+
earlyPairingCode = [payload manualEntryCode];
883+
}
884+
885+
// If we have a stale internally-created commissioning still parked at the
886+
// post-PASE waiting state for a *different* setup payload, treat the new
887+
// setupCommissioningSessionWithPayload: call as an implicit cancel of the
888+
// previous one. Otherwise the new call would fail with CHIP_ERROR_BUSY
889+
// (0xDB) because the controller still considers the prior commissioning
890+
// in flight, and the user would be stuck until a process restart.
891+
//
892+
// IMPORTANT: do this BEFORE resetting/beginning kMetricDeviceCommissioning
893+
// for the new attempt. -_cancelCommissioning routes through the bridge,
894+
// which calls MATTER_LOG_METRIC_END(kMetricDeviceCommissioning, ...) -- if
895+
// we had already begun the new attempt's metric, the synthetic CANCELLED
896+
// from the stale operation would close out the new attempt's metric,
897+
// leaving it without a matching begin/end and skewing telemetry. Also
898+
// bail out early on cancel failure here, before any new-attempt metric
899+
// state is touched, so we never leave kMetricDeviceCommissioning open
900+
// across a returned-NO failure.
901+
auto * staleCommissioning = self.currentCommissioning;
902+
if (staleCommissioning && staleCommissioning.isInternallyCreated
903+
&& staleCommissioning.isWaitingAfterPASEEstablished
904+
&& earlyPairingCode != nil
905+
&& ![staleCommissioning.setupPayload isEqualToString:earlyPairingCode]) {
906+
MTR_LOG("%@ implicitly cancelling stale post-PASE commissioning %@ before starting a new one with a different payload", self, staleCommissioning);
907+
908+
// Synchronously tear down the stale operation's post-PASE watchdog
909+
// BEFORE we begin the new attempt's metrics. The setter on
910+
// isWaitingAfterPASEEstablished:NO bounces the teardown onto
911+
// _delegateQueue via dispatch_async, which leaves a race window: a
912+
// watchdog timer block already on _delegateQueue could still fire
913+
// AFTER we have called MATTER_LOG_METRIC_BEGIN(kMetricDeviceCommissioning)
914+
// for the new attempt. That fire would route through
915+
// -_dispatchCommissioningError -> MATTER_LOG_METRIC_END, ending
916+
// the NEW attempt's metric span and corrupting telemetry -- exactly
917+
// the bug this PR exists to prevent. -_syncCancelPostPASEWatchdog
918+
// guarantees no further timer block can run.
919+
[staleCommissioning _syncCancelPostPASEWatchdog];
920+
921+
// Use the atomic stop variant so the "are we still the current
922+
// commissioning?" gate check and the StopPairing call happen under
923+
// a single controller lock. This closes the TOCTOU window between
924+
// reading self.currentCommissioning above and the StopPairing call
925+
// inside -_cancelCommissioning -- a concurrent setter on another
926+
// thread could otherwise swap the slot out from under us.
927+
MTRStopCommissioningOutcome outcome = [self stopCommissioningAtomically:staleCommissioning
928+
forCommissioningID:staleCommissioning.commissioningID];
929+
switch (outcome) {
930+
case MTRStopCommissioningOutcomeReplaced:
931+
case MTRStopCommissioningOutcomeStopped:
932+
// Replaced: another path already cleared the stale slot for us
933+
// (e.g. -_commissioningDone: ran while we were preparing). We
934+
// can proceed with the new attempt either way.
935+
// Stopped: the atomic stop succeeded; the stale slot is cleared
936+
// synchronously under _currentCommissioningLock. Proceed.
937+
break;
938+
case MTRStopCommissioningOutcomeFailedStillCurrent:
939+
// StopPairing failed at the CHIP layer; surface the error rather
940+
// than silently starting a new commissioning that is about to
941+
// hit the very CHIP_ERROR_BUSY we just failed to clear. No
942+
// new-attempt metrics have been begun yet, so there is nothing
943+
// to MATTER_LOG_METRIC_END here.
944+
MTR_LOG_ERROR("%@ failed to implicitly cancel stale commissioning %@", self, staleCommissioning);
945+
if (error) {
946+
*error = [MTRError errorForCHIPErrorCode:CHIP_ERROR_BUSY];
947+
}
948+
return NO;
949+
}
950+
}
951+
864952
[[MTRMetricsCollector sharedInstance] resetMetrics];
865953

866954
// Track overall commissioning
867955
MATTER_LOG_METRIC_BEGIN(kMetricDeviceCommissioning);
868956
emitMetricForSetupPayload(payload);
869957

870-
// Try to get a QR code if possible (because it has a better
871-
// discriminator, etc), then fall back to manual code if that fails.
872-
NSString * pairingCode = [payload qrCodeString:nil];
873-
if (pairingCode == nil) {
874-
pairingCode = [payload manualEntryCode];
875-
}
958+
NSString * pairingCode = earlyPairingCode;
876959
if (pairingCode == nil) {
877960
CHIP_ERROR errorCode = CHIP_ERROR_INVALID_ARGUMENT;
878961
MATTER_LOG_METRIC_END(kMetricDeviceCommissioning, errorCode);
@@ -1357,6 +1440,42 @@ - (BOOL)stopCommissioning:(MTRCommissioningOperation *)commissioning forCommissi
13571440
return [self _cancelCommissioning:currentCommissioning forNodeID:commissioningID error:nil];
13581441
}
13591442

1443+
- (MTRStopCommissioningOutcome)stopCommissioningAtomically:(MTRCommissioningOperation *)commissioning
1444+
forCommissioningID:(NSNumber *)commissioningID
1445+
{
1446+
// Hold _currentCommissioningLock across BOTH the "are we still the
1447+
// current commissioning?" gate check AND the StopPairing call so a
1448+
// concurrent setter of currentCommissioning cannot swap the slot out
1449+
// from under us between the two operations. This eliminates the
1450+
// TOCTOU window the watchdog fire path used to have when it did the
1451+
// check and the stop as two independent reads of currentCommissioning.
1452+
//
1453+
// The lock-order concern here is whether StopPairing (via
1454+
// syncRunOnWorkQueue on the Matter queue) can call back into a path
1455+
// that re-acquires _currentCommissioningLock and deadlocks. In
1456+
// practice it cannot: the only callbacks that mutate currentCommissioning
1457+
// (commissioningDone: -> _commissioningDone:) are delivered via
1458+
// dispatch_async on _commissioningQueue, so they cannot re-enter
1459+
// synchronously under this lock; they will simply queue behind us
1460+
// and acquire the lock after we have returned.
1461+
os_unfair_lock_lock(&_currentCommissioningLock);
1462+
if (self.currentCommissioning != commissioning) {
1463+
os_unfair_lock_unlock(&_currentCommissioningLock);
1464+
MTR_LOG("%@ stopCommissioningAtomically: commissioning %@ was already replaced; no-op", self, commissioning);
1465+
return MTRStopCommissioningOutcomeReplaced;
1466+
}
1467+
1468+
NSError * stopError = nil;
1469+
BOOL ok = [self _cancelCommissioning:commissioning forNodeID:commissioningID error:&stopError];
1470+
os_unfair_lock_unlock(&_currentCommissioningLock);
1471+
1472+
if (!ok) {
1473+
MTR_LOG_ERROR("%@ stopCommissioningAtomically: StopPairing failed for commissioning %@: %@", self, commissioning, stopError);
1474+
return MTRStopCommissioningOutcomeFailedStillCurrent;
1475+
}
1476+
return MTRStopCommissioningOutcomeStopped;
1477+
}
1478+
13601479
- (BOOL)_cancelCommissioning:(MTRCommissioningOperation *)commissioning forNodeID:(NSNumber *)nodeID error:(NSError * __autoreleasing *)error
13611480
{
13621481
BOOL isWaitingAfterPASEEstablished = commissioning.isWaitingAfterPASEEstablished;

0 commit comments

Comments
 (0)