[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391
Open
mergify[bot] wants to merge 1 commit into
Open
[Messaging] Honor peer SAT in MRP retransmit backoff for ICDs (backport #72230)#72391mergify[bot] wants to merge 1 commit into
mergify[bot] wants to merge 1 commit into
Conversation
ReliableMessageMgr::CalculateNextRetransTime short-circuited to
mActiveRetransTimeout (SAI) for the rest of an exchange as soon as
ExchangeContext::HasReceivedAtLeastOneMessage() became true. The else
branch already did the spec-correct thing — call GetMRPBaseTimeout(),
which evaluates IsPeerActive() against the peer's Session Active
Threshold — but it only ran before the first received message.
For a peer that is an Intermittently Connected Device (ICD) advertising
something like SII=6000, SAI=1200, SAT=1000, this meant every retx
scheduled after we received the first message in an exchange used an
SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender
boost) instead of the SII-derived backoff (multiple seconds) the device
actually polled on. Every retransmit landed inside the peer's sleep
window, the peer never observed them, and reliable delivery silently
failed.
The most visible failure mode is CASE Sigma3 to a sleepy device:
* Initiator sends Sigma1 -> peer sends Sigma2 -> initiator sends
Sigma3.
* Receiving Sigma2 flips HasReceivedAtLeastOneMessage() to true.
* Initiator's Sigma3 retransmits are spaced on SAI backoff.
* Peer's SAT elapses ~1s after Sigma2; peer drops to its SII polling
cadence.
* All four retx fire inside sleep windows; peer keeps retransmitting
the same Sigma2 because it never received our piggybacked ack on
Sigma3 (nor any of our standalone acks, which are spaced the same
way).
* MRP exhausts retries, CASE session times out in kSentSigma3, and
the pairing flow surfaces a generic timeout error to the
application.
This reproduces against multiple sleepy Thread ICD designs from
unrelated vendors, which makes it clearly a controller-side bug rather
than per-device firmware.
Fix: remove the shortcut. Always use GetMRPBaseTimeout(). The Active vs
Idle decision is made on every retx schedule call against the peer's
current IsPeerActive() state — which is precisely what spec
section 4.11.2.1 prescribes. No other call sites or public APIs touched.
Tests added in src/messaging/tests/TestReliableMessageProtocol.cpp,
each modeled on the existing CheckIsPeerActiveNotInitiator pattern
(drop one or more sends, let the retransmit succeed, observe the
sender delegate, assert the retransmit table drains):
* CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry — core regression:
after the peer's SAT elapses, retx is scheduled on the Idle (SII)
interval even though the exchange has received a prior message.
* CheckICDPeerRetxUsesActiveBackoffWithinSATWindow — guards against
over-fixing: while the peer is still Active, retx remains on SAI.
* CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived — covers the
pre-existing else-branch behavior under the simplified code path.
The existing CheckIsPeerActiveNotInitiator test continues to pass: its
ActiveRetransTimeout=100ms scenario runs in sub-second time, the peer
remains within its (default) SAT window throughout, and
GetMRPBaseTimeout() returns mActiveRetransTimeout — matching the prior
shortcut's behavior.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit a8708ca)
|
PR #72391: Size comparison from 09d0453 to 7af991d Full report (3 builds for cc32xx, stm32)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ReliableMessageMgr::CalculateNextRetransTimeshort-circuits tomActiveRetransTimeout(SAI) for the rest of the exchange as soon asExchangeContext::HasReceivedAtLeastOneMessage()is true. The else branch already does the spec-correct thing — callGetMRPBaseTimeout(), which evaluatesIsPeerActive()against the peer's Session Active Threshold — but it only runs before the first received message.For a peer that is an Intermittently Connected Device (ICD) advertising something like
SII=6000, SAI=1200, SAT=1000, this means every retx scheduled after we receive the first message in an exchange uses an SAI-derived backoff (sub-second to ~3.5s after margin/jitter/sender boost) instead of the SII-derived backoff (multiple seconds) the device actually polls on. Every retransmit lands inside the peer's sleep window, the peer never observes them, and reliable delivery silently fails.The most visible failure mode is CASE Sigma3 to a sleepy device:
HasReceivedAtLeastOneMessage()to true.kSentSigma3, the pairing flow surfaces a generic timeout error to the application.This reproduces against multiple sleepy Thread ICD designs from unrelated vendors, which makes it clearly a controller-side bug rather than per-device firmware.
Fix
Remove the shortcut. Always use
GetMRPBaseTimeout(). The Active vs Idle decision is made on every retx schedule call against the peer's currentIsPeerActive()state — which is precisely what spec §4.11.2.1 prescribes. No other call sites or public APIs touched; the behavior change is observed only by ICD peers where SAT < first-retx backoff.Testing
Three new tests in
src/messaging/tests/TestReliableMessageProtocol.cpp, each modeled on the existingCheckIsPeerActiveNotInitiatorpattern — drop one or more sends, let the retransmit succeed, observe the sender delegate, then assert the retransmit table drains cleanly:CheckICDPeerRetxUsesIdleBackoffAfterSATExpiry— core regression. Peer is configuredSII=1500ms, SAI=100ms, SAT=50ms. Receiver gets a message (Active), test waits past SAT, sends a reliable response, drops it once. The retx is then allowed to succeed; total time-to-delivery is asserted >= the SII-derived floor (~1.5s), which is well above any SAI-derived spacing (~100-200ms) that would indicate the bug.CheckICDPeerRetxUsesActiveBackoffWithinSATWindow— guards against an over-broad fix. Same config butSAT=2000ms, so peer stays Active for the test duration. Retx-and-delivery must complete in well under 1.2s (SAI-fast), not on SII spacing.CheckPeerRetxUsesIdleBackoffWhenNoMessagesReceived— covers the pre-existing else-branch behavior under the simplified code path. New exchange, no prior receives → retx spaced on Idle interval sinceIsPeerActive()returns false on a fresh session.The existing
CheckIsPeerActiveNotInitiatortest continues to pass: itsActiveRetransTimeout=100msscenario runs in sub-second time, the peer remains within its (default 4s) SAT window throughout, andGetMRPBaseTimeout()returnsmActiveRetransTimeout— matching the prior shortcut's behavior.Spec references
Open review questions
HasReceivedAtLeastOneMessage()shortcut entirely, vs. layering an additionalIsPeerActive()check on top of it. The shortcut existed for "we know the peer is alive" reasons; the spec language argues that doesn't override per-retx Active/Idle re-evaluation.kMrpTimingMargin = 50ms). Reviewers may want to suggest mock-clock alternatives if Pigweed test infra supports them in this directory.This is an automatic backport of pull request #72230 done by [Mergify](https://mergify.com).