You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[MTRCommissioning] Add watchdog for client failure to advance past PASE
If a client implements commissioning:paseSessionEstablishmentComplete:
but then drops the flow without ever calling commissionNodeWithID: or
cancelCommissioningForNodeID:, the MTRCommissioningOperation sits in
isWaitingAfterPASEEstablished = YES forever. MTRDeviceController_Concrete
keeps the operation alive via currentInternalCommissioning, so every
subsequent commissioning attempt hits the busy check in startWithController
and fails with CHIP_ERROR_BUSY (0xDB). The user's symptom is "Matter
accessories refuse to connect" indefinitely until the daemon (or app) is
restarted.
Add a 5-minute dispatch_source_t watchdog armed when we set
isWaitingAfterPASEEstablished = YES and cancelled when:
- the property is set back to NO (via the new custom setter that fires
whenever commissionNodeWithID: succeeds on the controller),
- the operation is stopped,
- any terminal _dispatchCommissioningError path runs,
- or the operation deallocates.
On expiry, the watchdog stops the in-flight commissioning via
stopCommissioning:forCommissioningID: (which routes through StopPairing in
both the legacy isInternallyCreated:YES flow AND the public-API flow) and
routes a CHIP_ERROR_TIMEOUT through the standard _dispatchCommissioningError
path so client state can settle back to "no commissioning in flight". Five
minutes is comfortably longer than the commissionee fail-safe (60-180
seconds) but short enough that the user is unblocked within a single
sit-down troubleshooting session.
Also add a defensive secondary patch in
setupCommissioningSessionWithPayload:newNodeID:error: -- if a stale
internally-created commissioning is still parked at the post-PASE waiting
state with a *different* setup payload, treat the new call as an implicit
cancel of the previous one. Without this, the new call would have failed
with CHIP_ERROR_BUSY because the controller still considered the prior
commissioning in flight. The implicit cancel propagates any StopPairing
failure back to the caller rather than silently starting a new
commissioning that is about to hit the same busy state. The implicit
cancel needs the prior operation's commissioningID, so a small read-only
commissioningID accessor is added to MTRCommissioningOperation_Internal.h.
Threading:
- The isWaitingAfterPASEEstablished setter writes the flag synchronously
under an os_unfair_lock so a caller that immediately reads it back
observes the updated value, then bounces only the watchdog teardown
side effect onto _delegateQueue (which is the queue that owns the
dispatch_source_t timer).
- The watchdog event_handler consults isWaitingAfterPASEEstablished as
a late-fire guard: if the client has legitimately advanced past the
post-PASE waiting state but a timer block was already posted to the
queue ahead of the cancel, the handler swallows the late fire instead
of routing a spurious CHIP_ERROR_TIMEOUT into a now-active flow.
- -stop runs stopCommissioning:forCommissioningID: BEFORE clearing
isWaitingAfterPASEEstablished. A naive clear-then-stop ordering
would let the watchdog event_handler observe (flag=NO,
currentCommissioning=self) and suppress legitimate stop-driven
cleanup; the controller-transition-first ordering ensures either the
late-fire guard correctly suppresses (when the controller transition
has already happened) or the fire path runs idempotently (since
duplicate stopCommissioning: returns NO and the watchdog suppresses
the timeout via the "replaced" check).
- _dispatchCommissioningError clears isWaitingAfterPASEEstablished
synchronously before bouncing the watchdog cancel, so a timer event
already enqueued on _delegateQueue is suppressed by the late-fire
guard.
- The fire path uses stopCommissioning:forCommissioningID: rather than
cancelCommissioningForNodeID: so StopPairing runs on the CHIP layer
in both the isInternallyCreated:YES and isInternallyCreated:NO
flows. In the fire path we distinguish two failure modes for the
BOOL return: "replaced" (currentCommissioning != self -- a successor
has taken the slot, suppress the timeout) versus "genuine
StopPairing failure" (currentCommissioning == self but the C++
StopPairing returned an error, e.g. because the commissionee
fail-safe (60-180s) has already expired before the 5-minute
watchdog fires). In the genuine-failure case we MUST still
escalate the timeout to the client -- otherwise the very wedge the
watchdog exists to break is left in place silently.
- If dispatch_source_create fails when arming the watchdog, the
establishment-done path now routes a CHIP_ERROR_NO_MEMORY through
the standard failure path rather than entering the unbounded wait
state silently.
0 commit comments