Commit 6f71d8e
committed
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client implements commissioning:paseSessionEstablishmentComplete:
but then drops the flow without ever calling commissionNodeWithID: or
cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in
isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete
keeps the operation alive via currentInternalCommissioning, so every
subsequent commissioning attempt hits the busy check in startWithController
and fails with CHIP_ERROR_BUSY (0xDB). The user's symptom is "Matter
accessories refuse to connect" indefinitely until the daemon (or app) is
restarted.
Add a 5-minute dispatch_source_t watchdog armed when we set
isWaitingAfterPASEEstablished = YES and cancelled when:
- the property is set back to NO (via the new custom setter that fires
whenever commissionNodeWithID: succeeds on the controller),
- the operation is stopped,
- any terminal _dispatchCommissioningError path runs,
- or the operation deallocates.
On expiry, the watchdog stops the in-flight commissioning via
stopCommissioning:forCommissioningID: (which routes through StopPairing in
both the legacy isInternallyCreated:YES flow AND the public-API flow) and
routes a CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError
path so client state can settle back to "no commissioning in flight". Five
minutes is comfortably longer than the commissionee fail-safe (60-180
seconds) but short enough that the user is unblocked within a single
sit-down troubleshooting session.
Also add a defensive secondary patch in
setupCommissioningSessionWithPayload:newNodeID:error: -- if a stale
internally-created commissioning is still parked at the post-PASE waiting
state with a *different* setup payload, treat the new call as an implicit
cancel of the previous one. Without this, the new call would have failed
with CHIP_ERROR_BUSY because the controller still considered the prior
commissioning in flight. The implicit cancel propagates any StopPairing
failure back to the caller rather than silently starting a new
commissioning that is about to hit the same busy state. The implicit
cancel needs the prior operation's commissioningID, so a small read-only
commissioningID accessor is added to MTRCommissioningOperation_Internal.h.
Threading:
- The isWaitingAfterPASEEstablished setter writes the flag synchronously
under an os_unfair_lock so a caller that immediately reads it back
observes the updated value, then bounces only the watchdog teardown
side effect onto _delegateQueue (which is the queue that owns the
dispatch_source_t timer).
- The watchdog event_handler consults isWaitingAfterPASEEstablished as
a late-fire guard: if the client has legitimately advanced past the
post-PASE waiting state but a timer block was already posted to the
queue ahead of the cancel, the handler swallows the late fire instead
of routing a spurious CHIP_ERROR_TIMEOUT into a now-active flow.
- -stop runs stopCommissioning:forCommissioningID: BEFORE clearing
isWaitingAfterPASEEstablished. A naive clear-then-stop ordering
would let the watchdog event_handler observe (flag=NO,
currentCommissioning=self) and suppress legitimate stop-driven
cleanup; the controller-transition-first ordering ensures either the
late-fire guard correctly suppresses (when the controller transition
has already happened) or the fire path runs idempotently (since
duplicate stopCommissioning: returns NO and the watchdog suppresses
the timeout via the "replaced" check).
- _dispatchCommissioningError clears isWaitingAfterPASEEstablished
synchronously before bouncing the watchdog cancel, so a timer event
already enqueued on _delegateQueue is suppressed by the late-fire
guard.
- The fire path uses stopCommissioning:forCommissioningID: rather than
cancelCommissioningForNodeID: so StopPairing runs on the CHIP layer
in both the isInternallyCreated:YES and isInternallyCreated:NO
flows. In the fire path we distinguish two failure modes for the
BOOL return: "replaced" (currentCommissioning != self -- a successor
has taken the slot, suppress the timeout) versus "genuine
StopPairing failure" (currentCommissioning == self but the C++
StopPairing returned an error, e.g. because the commissionee
fail-safe (60-180s) has already expired before the 5-minute
watchdog fires). In the genuine-failure case we MUST still
escalate the timeout to the client -- otherwise the very wedge the
watchdog exists to break is left in place silently.
- If dispatch_source_create fails when arming the watchdog, the
establishment-done path now routes a CHIP_ERROR_NO_MEMORY through
the standard failure path rather than entering the unbounded wait
state silently.1 parent efc29a8 commit 6f71d8e
6 files changed
Lines changed: 4402 additions & 5 deletions
File tree
- src/darwin/Framework
- CHIPTests
- CHIP
- Matter.xcodeproj
0 commit comments