fix: substraction issue in transport#460
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the message deduplication logic in rs-matter to use wrapping subtraction for message counter differences, preventing potential panics in debug builds. The reviewer suggested a more idiomatic implementation of the wrapping arithmetic and recommended updating the forward-check logic to correctly handle 32-bit counter wrap-around by utilizing the signed difference.
|
PR #460: Size comparison from b23a6ab to c733487 Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
|
|
@skoky Please look at the suggestions from Gemini in its code review. Gemini sometimes suggests wrong things, but in either case, all comments must be addressed:
The above can also be done by whatever llm you are using, but must be done. |
|
Also, you must sign the CLA, or else I cannot merge your changes. |
Improving subst logic Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
PR #460: Size comparison from b23a6ab to f3d3261 Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
|
There was a problem hiding this comment.
Pull request overview
Fixes a debug-build overflow/panic in transport message-counter deduplication by changing how the signed counter delta is computed in RxCtrState::post_recv, making the subtraction safe across the full u32 range.
Changes:
- Compute the counter difference using
u32::wrapping_subbefore casting toi32, avoiding debug overflow panics in counter delta calculation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@skoky I'm closing in favor of #462:
|
|
@skoky One more thing and this thing is VERY important!: What controller/commissioner do you use? Why? Because hitting the integer overflow in debug - while it should not happen - means likely the controller/commissioner has a bug as well. The fact that you have such a big message counter (that it causes an I ran my concern with the AI, and here's the outcome: === My question: Look what another AI produced as a diagnostics of the original overflow-in-debug: "' What I'm really surprised by, is that it says Yes, they SHOULD be initialized with a random number, but MUST always be &-ed with 0x0fff_ffff, right? So to me getting a so-big msg_ctr or self.max_ctr sounds almost impossible? What do you think? Its answer: You're right and the diagnostic is incomplete. Let me lay out what's actually true. What the spec mandates (§4.6.1.1)All message counters SHALL be initialized with a random value using the Crypto_DRBG(len = 28) + 1 primitive. The random initializer ranges from 1 to 2²⁸. So 0x0fffffff isn't arbitrary — it's what the spec requires. A spec-compliant peer's initial counter is always in [1, 2²⁸]. The diagnostic saying "initialised to a random number" without the 28-bit qualifier is misleading. Why max_ctr on the receiver should also stay smallA fresh CASE/PASE session initializes RxCtrState::new(0). That matches spec §11422 / §11568 ("set the synchronized max_message_counter of the peer to 0"). So at session start: max_ctr = 0 on our side, and the peer's first message brings it to whatever the peer sent — which for a compliant peer is in [1, 2²⁸]. For the panic (msg_ctr as i32) - (max_ctr as i32) to overflow i32, we need one side ≥ 2³¹ and the other near zero — e.g. msg_ctr = 0x8000_0000, max_ctr = 1 gives i32::MIN - 1, panic. So under a strict spec-compliant scenario, you're right — this should be unreachable in commissioning. The diagnostic's narrative ("the new operational session started with a fresh random counter that happened to land in the problematic range") is implausible given a [1, 2²⁸] initial range. Either the diagnostic is wrong about the exact trigger, or the peer isn't fully spec-compliant. So what likely caused the original reportA guess, not a claim: it was probably a non-compliant peer (or a test/fuzz harness, or a packet replay with crafted counters during commissioning debugging). The diagnostic's framing of "the new session started with a fresh random counter that happened to land in the problematic range" is the AI rationalising what it saw — but a spec-compliant DRBG(28)+1 literally cannot land there. Worth being skeptical of that part of the diagnostic, while accepting that the panic itself is real and the fix is correct. My next question, a correctness of a wrong assumption on its side You are missing that we actually do not initialize the counter with 0 always, but with a true random value between 0 and 0x0fff_ffff. Look at Sessions::add body and how it is called. But I think even if we correctly initialize our sessions with a random value between 0 and 0x0fff_ffff, your analysis from above that we should not get an Answer: Right, I see — Sessions::add passes rand.next_u32() and Session::init masks it with MATTER_MSG_CTR_RANGE, so our local tx counter is uniform in [0, 2²⁸). (Off-by-one vs the spec's DRBG(28)+1 = [1, 2²⁸], but within tolerance.) And yes, my analysis holds. Here's why this changes nothing for the panic reachability: The i32 overflow in RxCtrState::post_recv depends only on the peer's counters, not ours: msg_ctr is whatever the peer transmitted. The constraint that matters is: does the peer keep its counter in [1, 2²⁸] initially and increment from there? If yes (spec-compliant), then within a single session max_ctr stays ≤ 2²⁸ + N (where N is the message count), and you'd need ~2³¹ - 2²⁸ ≈ 1.9 billion messages on one session before the overflow becomes reachable. Not happening during commissioning. So the conclusion is the same: with a spec-compliant peer, the original i32 arithmetic doesn't overflow in unicast RxCtrState during any realistic session lifetime. |
|
@ivmarkov I am using |
|
@ivmarkov It was fixed in the controller |
Please cnsider this small AI generated fix that has solved my problem with commissioning. Thanks.
Fix: Wrapping Subtraction in Message Counter Dedup
Problem
RxCtrState::post_recvinsrc/transport/dedup.rscomputed the signed differencebetween two message counters using plain
as i32casts followed by subtraction:Matter message counters are
u32values initialised to a random number. Whenmax_ctris larger thani32::MAX(2 147 483 647), casting it toi32producesa negative value. Subtracting that negative from a small positive
msg_ctrcanexceed
i32::MAX, which causes a panic in Rust debug builds(
attempt to subtract with overflow).In practice this was triggered during CASE (Sigma) session establishment right
after a successful commissioning sequence (ArmFailSafe ? CSR ? AddTrustedRootCert
? AddNOC), because the new operational session started with a fresh random counter
that happened to land in the problematic range relative to the stored
max_ctr.Fix
Changed the subtraction to use
wrapping_sub, which matches the intended modularcounter arithmetic and is safe for all
u32input values:This produces the correct signed distance between the two counters across the
entire
u32range without panicking in debug mode or producing undefined behaviourin release mode.