Darwin: commissioning silently wedges when client never advances past PASE#72267
Darwin: commissioning silently wedges when client never advances past PASE#72267woody-apple wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a watchdog timer (_postPASEWatchdog) in MTRCommissioningOperation to prevent the commissioning controller from getting permanently wedged in a busy state if a client drops the flow after PASE is established. It also adds thread-safe state management for isWaitingAfterPASEEstablished using os_unfair_lock. The review feedback recommends simplifying the implementation by consistently using the custom setter setIsWaitingAfterPASEEstablished: instead of manual locking and direct ivar access in several places.
There was a problem hiding this comment.
Pull request overview
This PR addresses a commissioning “wedge” on Darwin where commissioning can get stuck indefinitely after PASE if the client never advances (never calls commissionNodeWithID: or cancelCommissioningForNodeID:), causing subsequent commissioning attempts to fail with CHIP_ERROR_BUSY. The fix adds a post-PASE watchdog to force teardown after a bounded wait and adds controller-side logic to implicitly cancel stale, parked internal commissionings when a new setup payload arrives.
Changes:
- Add a 5-minute post-PASE watchdog in
MTRCommissioningOperationthat cancels/tears down stuck commissionings and routes a timeout through the standard failure path. - Add an implicit-cancel path in
MTRDeviceController_Concrete setupCommissioningSessionWithPayload:newNodeID:error:to preempt stale internally-created post-PASE-waiting commissionings when a different payload is provided. - Add extensive
MTRPairingTestscoverage for watchdog arming/canceling/firing behavior and the implicit-cancel guard conditions.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/darwin/Framework/CHIPTests/MTRPairingTests.m | Adds unit tests covering watchdog lifecycle, error delivery behavior, queue serialization, and implicit-cancel scenarios. |
| src/darwin/Framework/CHIP/MTRDeviceController_Concrete.mm | Adds implicit-cancel of stale post-PASE internal commissioning when a different setup payload arrives. |
| src/darwin/Framework/CHIP/MTRCommissioningOperation.mm | Implements post-PASE watchdog timer, state tracking/locking, late-fire guard, and teardown on stop/error paths. |
| src/darwin/Framework/CHIP/MTRCommissioningOperation_Internal.h | Exposes commissioningID and documents threading semantics for isWaitingAfterPASEEstablished. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PR #72267: Size comparison from ff181cb to 7a8a038 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #72267 +/- ##
==========================================
+ Coverage 55.52% 55.57% +0.04%
==========================================
Files 1630 1630
Lines 111127 111223 +96
Branches 13418 13409 -9
==========================================
+ Hits 61706 61812 +106
+ Misses 49421 49411 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7a8a038 to
003a5ed
Compare
|
PR #72267: Size comparison from 4894db9 to 8fe6a63 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
8fe6a63 to
8227502
Compare
|
PR #72267: Size comparison from 4894db9 to 23cfef4 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
8e87870 to
6565439
Compare
|
PR #72267: Size comparison from 8a162c6 to 6565439 Full report (12 builds for cc13x4_26x4, cc32xx, nrfconnect, qpg, realtek, stm32)
|
6565439 to
8c8e50e
Compare
|
PR #72267: Size comparison from 8a162c6 to 8c8e50e Full report (31 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
8c8e50e to
e7fb31b
Compare
e7fb31b to
6f71d8e
Compare
|
PR #72267: Size comparison from 8a162c6 to 6f71d8e Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
edfd973 to
db0e402
Compare
|
PR #72267: Size comparison from 8a162c6 to db0e402 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
db0e402 to
e687a9a
Compare
|
PR #72267: Size comparison from 8a162c6 to e687a9a Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
e687a9a to
a6e0d56
Compare
|
PR #72267: Size comparison from 8a162c6 to a6e0d56 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
a6e0d56 to
0511e67
Compare
|
PR #72267: Size comparison from 8a162c6 to 0511e67 Full report (6 builds for bl602, bl616, bl702, bl702l, nrfconnect)
|
0511e67 to
db87237
Compare
db87237 to
3fe6c21
Compare
|
PR #72267: Size comparison from f1767a8 to 0dc9f83 Full report (31 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
aa59da3 to
328cc33
Compare
|
PR #72267: Size comparison from f1767a8 to 328cc33 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
328cc33 to
3038774
Compare
|
PR #72267: Size comparison from f1767a8 to 3038774 Full report (35 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, psoc6, qpg, realtek, stm32, telink)
|
3038774 to
4dcc94a
Compare
|
PR #72267: Size comparison from 652855f to 4dcc94a Full report (5 builds for cc32xx, realtek, stm32)
|
59e5bde to
fd89f75
Compare
|
PR #72267: Size comparison from 9cf1485 to fd89f75 Full report (21 builds for bl602, bl616, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
If a client (CHIP framework consumer) implements commissioning:paseSessionEstablishmentComplete: but then never calls commissionNodeWithID: or cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete keeps the operation alive via currentInternalCommissioning, so every subsequent commissioning attempt hits the busy check in startWithController and fails with CHIP_ERROR_BUSY. User-visible symptom: Matter accessories refuse to commission until the daemon (or app) is restarted. Add a 5-minute dispatch_source_t watchdog armed when isWaitingAfterPASEEstablished is set to YES, cancelled on legitimate advance, stop, or terminal error. On expiry, calls stopCommissioning:forCommissioningID: and routes CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError path so client state can settle. Clear isWaitingAfterPASEEstablished BEFORE the controller stop call so -_cancelCommissioning does not synthesize a CANCELLED OnCommissioningComplete -- the wedge IS that no notifications arrived, so suppressing the synthetic CANCELLED avoids a redundant MATTER_LOG_METRIC_END(kMetricDeviceCommissioning) and a spurious CANCELLED on _delegateQueue ahead of the TIMEOUT we want to deliver. Defensive secondary fix in setupCommissioningSessionWithPayload:newNodeID:error:: if a stale internally-created commissioning is still parked at the post-PASE waiting state with a different setup payload, treat the new call as an implicit cancel of the previous one rather than failing the new call with BUSY. Perform the implicit cancel BEFORE MATTER_LOG_METRIC_BEGIN(kMetricDeviceCommissioning) and the metrics reset for the new attempt -- otherwise the synthetic CANCELLED end emitted by the bridge from the stale operation would close out the new attempt's metric, leaving it without a matching begin/end and skewing telemetry. This also keeps the cancel-failure early return from leaving kMetricDeviceCommissioning open across the returned NO. Threading hardening from review: * atomic test-only watchdog interval override so concurrent TSan setters on any queue race-free with the production read in _armPostPASEWatchdog * clamp non-finite (NaN / +/-INFINITY) test interval overrides to 0 (use production interval) so the multiply-by-NSEC_PER_SEC and cast to int64_t in -_armPostPASEWatchdog stays defined behavior * atomic stop via stopCommissioningAtomically:forCommissioningID:, closing the TOCTOU window between currentCommissioning read and StopPairing * synchronous _cancelPostPASEWatchdog in -stop so tests can observe the teardown immediately after -stop returns Honor the @Property (copy) declaration on -setupPayload by [copy]'ing the input in the designated initializer; mutable inputs were aliased. Regression coverage: * test_PostPASEWatchdogIntervalResetToZeroRestoresProductionInterval -- pins that +setPostPASEWatchdogIntervalForTesting:0 actually restores the production interval after a non-zero override. * test_ForceArmFailureFlag_ConcurrentSettersAndArmer_NoTSan -- TSan smoke test for the per-instance _forceNextArmFailureForTesting flag. * test_WatchdogCancel_IsSynchronousOnSelfFire -- pins the round-2 sync-cancel contract for the watchdog's own fire path; -_firePostPASEWatchdog must synchronously tear down _postPASEWatchdog at the top of the handler.
fd89f75 to
c5b2140
Compare
Summary
commissioning:paseSessionEstablishmentComplete:but then never callscommissionNodeWithID:orcancelCommissioningForNodeID:, theMTRCommissioningOperationsits inisWaitingAfterPASEEstablished = YESforever.MTRDeviceController_Concretekeeps the operation alive viacurrentInternalCommissioning, so every subsequent commissioning attempt hits the busy check instartWithControllerand fails withCHIP_ERROR_BUSY. User-visible symptom: Matter accessories refuse to commission until the daemon (or app) is restarted.dispatch_source_twatchdog armed whenisWaitingAfterPASEEstablishedis set toYES, cancelled on legitimate advance, stop, or terminal error. On expiry, callsstopCommissioning:forCommissioningID:and routesCHIP_ERROR_TIMEOUTthrough the standard_dispatchCommissioningErrorpath so client state can settle.setupCommissioningSessionWithPayload:newNodeID:error:: if a stale internally-created commissioning is still parked at the post-PASE waiting state with a different setup payload, treat the new call as an implicit cancel of the previous one rather than failing the new call with BUSY.Testing
MTRPairingTestscases exercising the watchdog arm/cancel/fire paths, late-fire guard, one-shot semantics, queue serialization, and the implicit-cancel and same-payload-no-cancel behaviors ofsetupCommissioningSessionWithPayload:.