-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: extend retransmission timeout for ROUTER_LATE broadcast window #9186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
fix: extend retransmission timeout for ROUTER_LATE broadcast window #9186
Conversation
@mdecourcy, Welcome to Meshtastic!Thanks for opening your first pull request. We really appreciate it. We discuss work as a team in discord, please join us in the #firmware channel. Welcome to the team 😄 |
e972293 to
9a6bf44
Compare
9a6bf44 to
5ebf52d
Compare
|
I'm not in favor of this. It will result in very long waiting times before retransmissions for all packets even though ROUTER_LATE is not common. And more importantly, ROUTER_LATE will first use the normal CLIENT delay window, and only when someone else already rebroadcasts before this, it will defer to the late rebroadcast window. Hence, this situation will only happen when for some reason the original transmitter did not receive the first rebroadcast (while the ROUTER_LATE did). |
ROUTER_LATE window applies to all CLIENT_BASE since #8567 was merged, and until running this PR on my clients I have been getting a lot of MAX_RETRANSMIT errors. Looking at the code https://github.com/meshtastic/firmware/blob/develop/src/mesh/RadioInterface.cpp#L304-L322 shouldRebroadcastEarlyLikeRouter is only utilized by router. getTxDelayMsec is only used for local packets. Unless I'm missing something, the late window is always utilized for CLIENT_BASE, as intended in #8567, which I agree with. The only issue I see is all of my CLIENT_MUTES hitting that max retransmit as the wait time is too short for the ROUTER_LATE (and now CLIENT_BASE) window. |
|
Router Late + Client Base = 20% of our mesh (165 nodes out of 815). 3.8% of the mesh is Router Late (31 nodes) |
|
Okay, so it's actually for CLIENT_BASE nodes, which indeed now always use the late rebroadcast window. Then I tend to agree with it, although it's a rather impactful change. It means that for each packet you'll have to wait longer until you get the MAX_RETRANSMISSON error. |
|
This MR will work, but I'm concerned it will increase timeouts for all packets (not just ROUTER_LATE) based on getRetransmissionMsec, so everything starts to use the worst-case delay. The added delay is actually more at 2 * pow_of_2(CWmax) + 2 * CWmax Removing airTime->channelUtilizationPercent() does not adapt due to channel load, even for node roles that aren't ROUTER_LATE.. Might be unintentional? The fix for this gets a little hairy since it requires re-doing retransmission logic. |
|
I have clanker written code (mostly because trying to wrap my head around timeout logic is painful) at https://github.com/h3lix1/mesh-firmware-sunl/tree/fix/router-late-implicit-ack-timeout if you want to pull it in and call it your own. |
|
I think @h3lix1's approach is the less impactful one, but it does not solve the issue of unnecessary retransmissions. We could probably also do some more intelligent things like detecting whether we have other 0-hop neighbors than a |
After running h3lix branch on my wismesh tag, I still see the issue. I can try my approach, but more targeted as per your suggestion |
ROUTER_LATE nodes intentionally delay rebroadcasts to provide coverage. This delay can be up to ~7 seconds (using
getTxDelayMsecWeightedWorstwithpow_of_2(CWmax) * slotTimeMsec).The retransmission timeout was previously ~5-6 seconds, causing the sender to hit MAX_RETRANSMIT and report failure before ROUTER_LATE nodes could rebroadcast. This resulted in false negative acknowledgments where messages successfully propagated but the sender incorrectly reported delivery failure.
Extended the timeout calculation to account for worst-case ROUTER_LATE delay by changing from
pow_of_2((CWmax+CWmin)/2)topow_of_2(CWmax), adding approximately 4.5-11 seconds depending on modem preset.This prevents clients from implementing unnecessary retry logic that creates duplicate messages on the mesh.
Closes #9185
🤝 Attestations
[ x ] I have tested that my proposed changes behave as described.
[ x ] I have tested that my proposed changes do not cause any obvious regressions on the following devices:
Rak Wismesh Tag, Seeed Solar Node