Skip to content

Commit 6f71d8e

Browse files
committed
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client implements commissioning:paseSessionEstablishmentComplete: but then drops the flow without ever calling commissionNodeWithID: or cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete keeps the operation alive via currentInternalCommissioning, so every subsequent commissioning attempt hits the busy check in startWithController and fails with CHIP_ERROR_BUSY (0xDB). The user's symptom is "Matter accessories refuse to connect" indefinitely until the daemon (or app) is restarted. Add a 5-minute dispatch_source_t watchdog armed when we set isWaitingAfterPASEEstablished = YES and cancelled when: - the property is set back to NO (via the new custom setter that fires whenever commissionNodeWithID: succeeds on the controller), - the operation is stopped, - any terminal _dispatchCommissioningError path runs, - or the operation deallocates. On expiry, the watchdog stops the in-flight commissioning via stopCommissioning:forCommissioningID: (which routes through StopPairing in both the legacy isInternallyCreated:YES flow AND the public-API flow) and routes a CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError path so client state can settle back to "no commissioning in flight". Five minutes is comfortably longer than the commissionee fail-safe (60-180 seconds) but short enough that the user is unblocked within a single sit-down troubleshooting session. Also add a defensive secondary patch in setupCommissioningSessionWithPayload:newNodeID:error: -- if a stale internally-created commissioning is still parked at the post-PASE waiting state with a *different* setup payload, treat the new call as an implicit cancel of the previous one. Without this, the new call would have failed with CHIP_ERROR_BUSY because the controller still considered the prior commissioning in flight. The implicit cancel propagates any StopPairing failure back to the caller rather than silently starting a new commissioning that is about to hit the same busy state. The implicit cancel needs the prior operation's commissioningID, so a small read-only commissioningID accessor is added to MTRCommissioningOperation_Internal.h. Threading: - The isWaitingAfterPASEEstablished setter writes the flag synchronously under an os_unfair_lock so a caller that immediately reads it back observes the updated value, then bounces only the watchdog teardown side effect onto _delegateQueue (which is the queue that owns the dispatch_source_t timer). - The watchdog event_handler consults isWaitingAfterPASEEstablished as a late-fire guard: if the client has legitimately advanced past the post-PASE waiting state but a timer block was already posted to the queue ahead of the cancel, the handler swallows the late fire instead of routing a spurious CHIP_ERROR_TIMEOUT into a now-active flow. - -stop runs stopCommissioning:forCommissioningID: BEFORE clearing isWaitingAfterPASEEstablished. A naive clear-then-stop ordering would let the watchdog event_handler observe (flag=NO, currentCommissioning=self) and suppress legitimate stop-driven cleanup; the controller-transition-first ordering ensures either the late-fire guard correctly suppresses (when the controller transition has already happened) or the fire path runs idempotently (since duplicate stopCommissioning: returns NO and the watchdog suppresses the timeout via the "replaced" check). - _dispatchCommissioningError clears isWaitingAfterPASEEstablished synchronously before bouncing the watchdog cancel, so a timer event already enqueued on _delegateQueue is suppressed by the late-fire guard. - The fire path uses stopCommissioning:forCommissioningID: rather than cancelCommissioningForNodeID: so StopPairing runs on the CHIP layer in both the isInternallyCreated:YES and isInternallyCreated:NO flows. In the fire path we distinguish two failure modes for the BOOL return: "replaced" (currentCommissioning != self -- a successor has taken the slot, suppress the timeout) versus "genuine StopPairing failure" (currentCommissioning == self but the C++ StopPairing returned an error, e.g. because the commissionee fail-safe (60-180s) has already expired before the 5-minute watchdog fires). In the genuine-failure case we MUST still escalate the timeout to the client -- otherwise the very wedge the watchdog exists to break is left in place silently. - If dispatch_source_create fails when arming the watchdog, the establishment-done path now routes a CHIP_ERROR_NO_MEMORY through the standard failure path rather than entering the unbounded wait state silently.
1 parent efc29a8 commit 6f71d8e

6 files changed

Lines changed: 4402 additions & 5 deletions

File tree

0 commit comments

Comments
 (0)