Skip to content

[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391

Open
mergify[bot] wants to merge 1 commit into
v1.4.2-branchfrom
mergify/bp/v1.4.2-branch/pr-72230
Open

[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391
mergify[bot] wants to merge 1 commit into
v1.4.2-branchfrom
mergify/bp/v1.4.2-branch/pr-72230

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify Bot commented Jun 3, 2026

Problem

ReliableMessageMgr::CalculateNextRetransTime short-circuits to mActiveRetransTimeout (SAI) for the rest of the exchange as soon as ExchangeContext::HasReceivedAtLeastOneMessage() is true. The else branch already does the spec-correct thing — call GetMRPBaseTimeout(), which evaluates IsPeerActive() against the peer's Session Active Threshold — but it only runs before the first received message.

For a peer that is an Intermittently Connected Device (ICD) advertising something like SII=6000, SAI=1200, SAT=1000, this means every retx scheduled after we receive the first message in an exchange uses an SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender boost) instead of the SII-derived backoff (multiple seconds) the device actually polls on. Every retransmit lands inside the peer's sleep window, the peer never observes them, and reliable delivery silently fails.

The most visible failure mode is CASE Sigma3 to a sleepy device:

  • Initiator sends Sigma1 → peer sends Sigma2 → initiator sends Sigma3.
  • Receiving Sigma2 flips HasReceivedAtLeastOneMessage() to true.
  • Initiator's Sigma3 retransmits are now spaced on SAI backoff.
  • Peer's SAT elapses ~1s after Sigma2; peer drops to its SII polling cadence.
  • All four retx fire inside sleep windows; peer keeps retransmitting the same Sigma2 because it never received our piggybacked ack on Sigma3 (nor any of our standalone acks, which are spaced the same way).
  • MRP exhausts retries, CASE session times out in kSentSigma3, the pairing flow surfaces a generic timeout error to the application.

This reproduces against multiple sleepy Thread ICD designs from unrelated vendors, which makes it clearly a controller-side bug rather than per-device firmware.

Fix

Remove the shortcut. Always use GetMRPBaseTimeout(). The Active vs Idle decision is made on every retx schedule call against the peer's current IsPeerActive() state — which is precisely what spec §4.11.2.1 prescribes. No other call sites or public APIs touched; the behavior change is observed only by ICD peers where SAT < first-retx backoff.

Testing

Three new tests in src/messaging/tests/TestReliableMessageProtocol.cpp, each modeled on the existing CheckIsPeerActiveNotInitiator pattern — drop one or more sends, let the retransmit succeed, observe the sender delegate, then assert the retransmit table drains cleanly:

  • CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry — core regression. Peer is configured SII=1500ms, SAI=100ms, SAT=50ms. Receiver gets a message (Active), test waits past SAT, sends a reliable response, drops it once. The retx is then allowed to succeed; total time-to-delivery is asserted >= the SII-derived floor (~1.5s), which is well above any SAI-derived spacing (~100-200ms) that would indicate the bug.
  • CheckICDPeerRetxUsesActiveBackoffWithinSATWindow — guards against an over-broad fix. Same config but SAT=2000ms, so peer stays Active for the test duration. Retx-and-delivery must complete in well under 1.2s (SAI-fast), not on SII spacing.
  • CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived — covers the pre-existing else-branch behavior under the simplified code path. New exchange, no prior receives → retx spaced on Idle interval since IsPeerActive() returns false on a fresh session.

The existing CheckIsPeerActiveNotInitiator test continues to pass: its ActiveRetransTimeout=100ms scenario runs in sub-second time, the peer remains within its (default 4s) SAT window throughout, and GetMRPBaseTimeout() returns mActiveRetransTimeout — matching the prior shortcut's behavior.

Spec references

  • Matter Core spec §4.11.2.1 Retransmissions
  • Matter Core spec §4.12 Intermittently Connected Devices

Open review questions

  1. Approach — removing the HasReceivedAtLeastOneMessage() shortcut entirely, vs. layering an additional IsPeerActive() check on top of it. The shortcut existed for "we know the peer is alive" reasons; the spec language argues that doesn't override per-retx Active/Idle re-evaluation.
  2. Test design — wall-clock timing thresholds are tuned for stability on CI runners (kMrpTimingMargin = 50ms). Reviewers may want to suggest mock-clock alternatives if Pigweed test infra supports them in this directory.
  3. Spec interpretation — confirming that "Active/Idle decision is per-retx-schedule" is the spec-correct reading.

This is an automatic backport of pull request #72230 done by [Mergify](https://mergify.com).

ReliableMessageMgr::CalculateNextRetransTime short-circuited to
mActiveRetransTimeout (SAI) for the rest of an exchange as soon as
ExchangeContext::HasReceivedAtLeastOneMessage() became true. The else
branch already did the spec-correct thing — call GetMRPBaseTimeout(),
which evaluates IsPeerActive() against the peer's Session Active
Threshold — but it only ran before the first received message.

For a peer that is an Intermittently Connected Device (ICD) advertising
something like SII=6000, SAI=1200, SAT=1000, this meant every retx
scheduled after we received the first message in an exchange used an
SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender
boost) instead of the SII-derived backoff (multiple seconds) the device
actually polled on. Every retransmit landed inside the peer's sleep
window, the peer never observed them, and reliable delivery silently
failed.

The most visible failure mode is CASE Sigma3 to a sleepy device:

  * Initiator sends Sigma1 -> peer sends Sigma2 -> initiator sends
    Sigma3.
  * Receiving Sigma2 flips HasReceivedAtLeastOneMessage() to true.
  * Initiator's Sigma3 retransmits are spaced on SAI backoff.
  * Peer's SAT elapses ~1s after Sigma2; peer drops to its SII polling
    cadence.
  * All four retx fire inside sleep windows; peer keeps retransmitting
    the same Sigma2 because it never received our piggybacked ack on
    Sigma3 (nor any of our standalone acks, which are spaced the same
    way).
  * MRP exhausts retries, CASE session times out in kSentSigma3, and
    the pairing flow surfaces a generic timeout error to the
    application.

This reproduces against multiple sleepy Thread ICD designs from
unrelated vendors, which makes it clearly a controller-side bug rather
than per-device firmware.

Fix: remove the shortcut. Always use GetMRPBaseTimeout(). The Active vs
Idle decision is made on every retx schedule call against the peer's
current IsPeerActive() state — which is precisely what spec
section 4.11.2.1 prescribes. No other call sites or public APIs touched.

Tests added in src/messaging/tests/TestReliableMessageProtocol.cpp,
each modeled on the existing CheckIsPeerActiveNotInitiator pattern
(drop one or more sends, let the retransmit succeed, observe the
sender delegate, assert the retransmit table drains):

  * CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry — core regression:
    after the peer's SAT elapses, retx is scheduled on the Idle (SII)
    interval even though the exchange has received a prior message.
  * CheckICDPeerRetxUsesActiveBackoffWithinSATWindow — guards against
    over-fixing: while the peer is still Active, retx remains on SAI.
  * CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived — covers the
    pre-existing else-branch behavior under the simplified code path.

The existing CheckIsPeerActiveNotInitiator test continues to pass: its
ActiveRetransTimeout=100ms scenario runs in sub-second time, the peer
remains within its (default) SAT window throughout, and
GetMRPBaseTimeout() returns mActiveRetransTimeout — matching the prior
shortcut's behavior.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit a8708ca)
@mergify mergify Bot added the backport-v1.4.2-branch Backport PR targeting v1.4.2-branch, created by Mergify label Jun 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

PR #72391: Size comparison from 09d0453 to 7af991d

Full report (3 builds for cc32xx, stm32)
platform target config section 09d0453 7af991d change % change
cc32xx air-purifier CC3235SF_LAUNCHXL FLASH 550146 550122 -24 -0.0
RAM 205176 205176 0 0.0
lock CC3235SF_LAUNCHXL FLASH 583386 583370 -16 -0.0
RAM 205384 205384 0 0.0
stm32 light STM32WB5MM-DK FLASH 466512 466488 -24 -0.0
RAM 141376 141376 0 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v1.4.2-branch Backport PR targeting v1.4.2-branch, created by Mergify messaging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant