[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230) by mergify[bot] · Pull Request #72391 · project-chip/connectedhomeip

mergify · 2026-06-03T19:10:04Z

Problem

ReliableMessageMgr::CalculateNextRetransTime short-circuits to mActiveRetransTimeout (SAI) for the rest of the exchange as soon as ExchangeContext::HasReceivedAtLeastOneMessage() is true. The else branch already does the spec-correct thing — call GetMRPBaseTimeout(), which evaluates IsPeerActive() against the peer's Session Active Threshold — but it only runs before the first received message.

For a peer that is an Intermittently Connected Device (ICD) advertising something like SII=6000, SAI=1200, SAT=1000, this means every retx scheduled after we receive the first message in an exchange uses an SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender boost) instead of the SII-derived backoff (multiple seconds) the device actually polls on. Every retransmit lands inside the peer's sleep window, the peer never observes them, and reliable delivery silently fails.

The most visible failure mode is CASE Sigma3 to a sleepy device:

Initiator sends Sigma1 → peer sends Sigma2 → initiator sends Sigma3.
Receiving Sigma2 flips HasReceivedAtLeastOneMessage() to true.
Initiator's Sigma3 retransmits are now spaced on SAI backoff.
Peer's SAT elapses ~1s after Sigma2; peer drops to its SII polling cadence.
All four retx fire inside sleep windows; peer keeps retransmitting the same Sigma2 because it never received our piggybacked ack on Sigma3 (nor any of our standalone acks, which are spaced the same way).
MRP exhausts retries, CASE session times out in kSentSigma3, the pairing flow surfaces a generic timeout error to the application.

This reproduces against multiple sleepy Thread ICD designs from unrelated vendors, which makes it clearly a controller-side bug rather than per-device firmware.

Fix

Remove the shortcut. Always use GetMRPBaseTimeout(). The Active vs Idle decision is made on every retx schedule call against the peer's current IsPeerActive() state — which is precisely what spec §4.11.2.1 prescribes. No other call sites or public APIs touched; the behavior change is observed only by ICD peers where SAT < first-retx backoff.

Testing

Three new tests in src/messaging/tests/TestReliableMessageProtocol.cpp, each modeled on the existing CheckIsPeerActiveNotInitiator pattern — drop one or more sends, let the retransmit succeed, observe the sender delegate, then assert the retransmit table drains cleanly:

CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry — core regression. Peer is configured SII=1500ms, SAI=100ms, SAT=50ms. Receiver gets a message (Active), test waits past SAT, sends a reliable response, drops it once. The retx is then allowed to succeed; total time-to-delivery is asserted >= the SII-derived floor (~1.5s), which is well above any SAI-derived spacing (~100-200ms) that would indicate the bug.
CheckICDPeerRetxUsesActiveBackoffWithinSATWindow — guards against an over-broad fix. Same config but SAT=2000ms, so peer stays Active for the test duration. Retx-and-delivery must complete in well under 1.2s (SAI-fast), not on SII spacing.
CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived — covers the pre-existing else-branch behavior under the simplified code path. New exchange, no prior receives → retx spaced on Idle interval since IsPeerActive() returns false on a fresh session.

The existing CheckIsPeerActiveNotInitiator test continues to pass: its ActiveRetransTimeout=100ms scenario runs in sub-second time, the peer remains within its (default 4s) SAT window throughout, and GetMRPBaseTimeout() returns mActiveRetransTimeout — matching the prior shortcut's behavior.

Spec references

Matter Core spec §4.11.2.1 Retransmissions
Matter Core spec §4.12 Intermittently Connected Devices

Open review questions

Approach — removing the HasReceivedAtLeastOneMessage() shortcut entirely, vs. layering an additional IsPeerActive() check on top of it. The shortcut existed for "we know the peer is alive" reasons; the spec language argues that doesn't override per-retx Active/Idle re-evaluation.
Test design — wall-clock timing thresholds are tuned for stability on CI runners (kMrpTimingMargin = 50ms). Reviewers may want to suggest mock-clock alternatives if Pigweed test infra supports them in this directory.
Spec interpretation — confirming that "Active/Idle decision is per-retx-schedule" is the spec-correct reading.

This is an automatic backport of pull request #72230 done by [Mergify](https://mergify.com).

ReliableMessageMgr::CalculateNextRetransTime short-circuited to mActiveRetransTimeout (SAI) for the rest of an exchange as soon as ExchangeContext::HasReceivedAtLeastOneMessage() became true. The else branch already did the spec-correct thing — call GetMRPBaseTimeout(), which evaluates IsPeerActive() against the peer's Session Active Threshold — but it only ran before the first received message. For a peer that is an Intermittently Connected Device (ICD) advertising something like SII=6000, SAI=1200, SAT=1000, this meant every retx scheduled after we received the first message in an exchange used an SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender boost) instead of the SII-derived backoff (multiple seconds) the device actually polled on. Every retransmit landed inside the peer's sleep window, the peer never observed them, and reliable delivery silently failed. The most visible failure mode is CASE Sigma3 to a sleepy device: * Initiator sends Sigma1 -> peer sends Sigma2 -> initiator sends Sigma3. * Receiving Sigma2 flips HasReceivedAtLeastOneMessage() to true. * Initiator's Sigma3 retransmits are spaced on SAI backoff. * Peer's SAT elapses ~1s after Sigma2; peer drops to its SII polling cadence. * All four retx fire inside sleep windows; peer keeps retransmitting the same Sigma2 because it never received our piggybacked ack on Sigma3 (nor any of our standalone acks, which are spaced the same way). * MRP exhausts retries, CASE session times out in kSentSigma3, and the pairing flow surfaces a generic timeout error to the application. This reproduces against multiple sleepy Thread ICD designs from unrelated vendors, which makes it clearly a controller-side bug rather than per-device firmware. Fix: remove the shortcut. Always use GetMRPBaseTimeout(). The Active vs Idle decision is made on every retx schedule call against the peer's current IsPeerActive() state — which is precisely what spec section 4.11.2.1 prescribes. No other call sites or public APIs touched. Tests added in src/messaging/tests/TestReliableMessageProtocol.cpp, each modeled on the existing CheckIsPeerActiveNotInitiator pattern (drop one or more sends, let the retransmit succeed, observe the sender delegate, assert the retransmit table drains): * CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry — core regression: after the peer's SAT elapses, retx is scheduled on the Idle (SII) interval even though the exchange has received a prior message. * CheckICDPeerRetxUsesActiveBackoffWithinSATWindow — guards against over-fixing: while the peer is still Active, retx remains on SAI. * CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived — covers the pre-existing else-branch behavior under the simplified code path. The existing CheckIsPeerActiveNotInitiator test continues to pass: its ActiveRetransTimeout=100ms scenario runs in sub-second time, the peer remains within its (default) SAT window throughout, and GetMRPBaseTimeout() returns mActiveRetransTimeout — matching the prior shortcut's behavior. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit a8708ca)

github-actions · 2026-06-03T20:17:13Z

PR #72391: Size comparison from 09d0453 to 7af991d

Full report (3 builds for cc32xx, stm32)

platform	target	config	section	`09d0453`	`7af991d`	change	% change
cc32xx	air-purifier	CC3235SF_LAUNCHXL	FLASH	550146	550122	-24	-0.0
			RAM	205176	205176	0	0.0
	lock	CC3235SF_LAUNCHXL	FLASH	583386	583370	-16	-0.0
			RAM	205384	205384	0	0.0
stm32	light	STM32WB5MM-DK	FLASH	466512	466488	-24	-0.0
			RAM	141376	141376	0	0.0

mergify Bot added the backport-v1.4.2-branch Backport PR targeting v1.4.2-branch, created by Mergify label Jun 3, 2026

github-actions Bot added the messaging label Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391

[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391
mergify[bot] wants to merge 1 commit into
v1.4.2-branchfrom
mergify/bp/v1.4.2-branch/pr-72230

mergify Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergify Bot commented Jun 3, 2026

Problem

Fix

Testing

Spec references

Open review questions

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant