- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
Add packet replay feature #8049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| 
           I haven't tested it yet but the concept looks great. I'm not sure how this is interacts with roles right now, but it should probably be disabled for CLIENT_MUTE, CLIENT_HIDDEN, and... TRACKER, maybe?  | 
    
          
 It is supposed to degrade gracefully, with routers prioritised. Are you able to expand a bit on the rationale for disabling it entirely on the roles you listed?  | 
    
| 
           This definitely looks like it works, and doesn't require a lot of processing power. Not to cause you heartburn and hair loss, but have you considered minisketch? https://bitcoinops.org/en/topics/minisketch/ Obviously your header has more information than what packets are missing, but minisketch can re-create the full frame of missing IDs. maybe add the trickle algorithm.. Anyway, just ideas. Now if the devices have enough CPU to do this? Maybe.  | 
    
          
 I am explicitly not wanting to send the full packet identifier (the (from,id) tuple), in order to minimise airtime. The full tuple is 8 bytes, vs just 2 bytes for the hash. 
 I'm unsure what that would achieve. Trickle seems completely inapplicable to what I'm trying to do here. Can you explain why specifically you think it's relevant, and what problem it would solve?  | 
    
          
 All of those roles are not intended to handle other people's packets at all, so it seems reasonable that you wouldn't want them to suddenly start moving forwarding packets - especially CLIENT_HIDDEN!  | 
    
          
 
  | 
    
| 
           Think of minisketch as a way of forward error correction. If you're given two lists of IDs with one missing a few IDs, you can compare a list of 1000 items, and reconstruct the one (or two, or three, or n) missing items. Bytes sent as a "sketch" is dependent on how many items you're interested in being able to recover. (i.e. if you want to be able to recover 5 32 bit IDs out of 100 items, it only needs 155 bits. It might be too complicated for the meshtastic use-case as it does require CPU (so maybe an esp32-only solution), but in situations where there are hundreds (or thousands) of packet IDs to compare, it takes the same message space as if there are 10 items to compare. Check it out from https://github.com/bitcoin-core/minisketch .. install the library. Compile the following... The demo basically creates a list of 32 bit hex values. delete a few, and with the use of the "sketch" it can re-create the values that are missing in the list. So in this case. Create a list of 1000 values, with -c of 3, allowing for any three items to disappear/change in this list. Now deleted three in the list.. (items 2-5) Now using the remaining list + sketch, I'm able to recreate the missing items. It found the missing 3 items in the list of 1000 with the sketch. The order of the list doesn't matter. It can also see if there are any on our side that are different than the remote side.. for example I changed 0x009bc928 back and added to 0x009bc929, it saw the differences. With a 200 byte sketch, it should be possible to recover 50 32 bit IDs out of a list. You can still add your bitfield to this and make it 43 bytes and recover 37 IDs and have a few bits left over. Given IDs normally use (sender, id) as the keys, that will be about 25 (sender, id) pairs. The trickle algorithm is simply a way to determine how often to send out the 0-hop packet. Now if there are 25 missing messages, getting those across will be its own challenge.  | 
    
| 
           What happens if half the list is missing? My impression is that this would render the algorithm entirely useless, yes? Does the receiving receiving node need to successfully receive data about e.g. 100 packets in order to reconstruct the missing ten? If so, then this would cause an unacceptable delay, vs being able to immediately request even if the history is unknown. Recovering 37 identifiers plus bitfields from 200 bytes is also considerably less efficient than what I'm already doing. Remember also that meshtastic's packet identifiers are 64 bits long, not 32. The 'from' field is part of the identifier. From what I've seen here, the sketch / trickle stuff looks like a neat system to solve a different problem, but still doesn't IMO seem terribly applicable to this one.  | 
    
| 
           It doesn't have to wait for 100, it could be variable. It can also be variable how many can be recovered. I did mention about (sender, id) as part of the key, which will be 64 bits. I guess my terminology for calling it "sender" It loses a lot of efficiency if more than half are missing, but at least it will know exactly which ones are missing. No worries if you think it doesn't fit here - I was just providing a potential option with new shiny toy I found the other day.  | 
    
| 
           Yeah, I don't think it's really a good fit for this particular problem. Thanks though 🙂  | 
    
This feature causes rebroadcasting nodes to periodically advertise a zero-hop
summary of packets that they have recently rebroadcast. Other nodes can use
this information to notice that they have missed packets, and request that the
advertising node retransmit them again.
This feature is currently enabled by default, pending implementation of the
necessary config elements to adjust it remotely. In the meantime, it can be
disabled entirely by defining MESHTASTIC_EXCLUDE_REPLAY to a non-zero value
when building. All tunables are currently statically defined in ReplayModule.h.
In order to minimise overhead on-air, this feature uses a non-protobuf payload:
| Offset | Size | Description                                                            |
|--------|------|------------------------------------------------------------------------|
| 0      | 2    | Advert or request type                                                 |
| 2      | 1    | This message advertises or requests high-priority packets only         |
| 3      | 1    | (adverts only) This is the first advert since the sender booted        |
| 4      | 1    | The sender is using an infrastructure rebroadcast role                 |
| 5      | 1    | (adverts only) This is an aggregate of specific prior advertisements   |
| 6      | 1    | (adverets only) This advertisement contains a list of throttled clients|
| 7      | 1    | Reserved for future use                                                |
| 8      | 5    | The base sequence number to which this advertisement or request refers |
| 13     | 1    | Reserved for future use                                                |
| 14     | 1    | Reserved for future use                                                |
| 15     | 1    | Reserved for future use                                                |
Advertisements consist of the standard 2-byte header, followed by:
 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are referenced
    - A 2-byte priority bitmap, indicating which packets in this range are high priority
    - For each included packet, a 2-byte hash: ((from ^ id) >> 16 & 0xFFFF) ^ ((from ^ id) & 0xFFFF)
 - If the 'aggregate' flag is set, a 2-byte bitmap indicating which other adverts are included in this one
 - If the 'throttled' flag is set, a list of truncated-to-one-byte node IDs that have been throttled
A typical advertisement will have a payload size of 8 bytes plus 2 bytes per included packet.
If packets are requested, but no longer cached, then the sender will send a
state advertisement indicating which of its advertised packets are no longer
available.
This consists of the standard 2-byte header, followed by:
 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are no longer available
Requests consist of the standard 2-byte header, followed by:
 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are being requested
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are being requested
    | 
           Latest rebase & stats update has introduced something that causes spontaneous reboots. Am investigating why currently, but if you are testing this, please bear in mind that this PR is currently unstable.  | 
    
| 
           Am considering changing the approach to notably simplify the code, at the expense of slightly more overhead. A couple of people I've run this past have had a hard time understanding it, and for the sake of maintainability that makes me wonder if I've drawn the airtime optimisation vs complexity line in the wrong place. Will do a parallel PR that solves this same problem in a different way, and compare the result. [Edit October 2nd: The parallel approach is better. Watch this space. Should have it substantially ready to go within the next couple of weeks.] [October 16th progress update: Taking a bit longer than expected, but making steady progress. Hoping a new PR by the end of next week. New caching approach already merged to master.] [October 27th progress update: Following discussion with @NomDeTom, this thing has grown some extra legs - so the core PR is taking longer than anticipated. I have started pushing the supporting parts (cache, stats, unencrypted payloads) as separate PRs.]  | 
    
          
 When you say "the entire mesh", what exactly do you mean? My understanding is that a hypothetical, all-knowing, ideal routing algorithm should (for broadcast/channel messages to !ffffff): 
 For completeness, a hypothetical, all-knowing, ideal routing algorithm should (for direct messages to a specific node): 
  | 
    
          
 I mean pretty much as you summarised it. If a node B is reachable within the hop limit of node A, then a broadcast message from node A should be successfully delivered to node B, without undue overhead or inefficiency.  | 
    
| 
           @erayd , is it OK if we re-target this to the develop branch? Master is going into a feature freeze while we get a beta out.  | 
    
          
 @fifieldt This PR is going to be closed without merge, because I've significantly changed the way this feature works - I'm intending to open a new PR for that incarnation of it  I would appreciate it if this PR could remain against master, because it keeps the diff clean without needing me to rebase it (which will require conflict resolution). It's essentially just a reference PR at this point, and I'm intending to close it once the new PR is ready.  | 
    
          
 Any role designed to receive messages should be able to respond to a replay advertisement message. This means that Client_Mute should be able to respond to one with a request along with other message receiving roles. The only role that should probably not do this is roles that are intentionally trying to maintain a no broadcast policy or roles that only broadcast but don't receive. Able to respond roles: Client: yes I don't know anything about Tak but if it works the way I think it does: Tak: Yes The other sides is who should send out advertisements. Client: Yes As for the TAKs: TAK_TRACKER: No  | 
    
          
 This description sounds like a problem. The point of this feature is to help with paths that normally get suppressed. It should be all nodes involved in relaying messages that make these periodic advertisements. Not just the one that did the actual broadcast. One of the primary uses is if there is a node cut off because its only connecting node gets its broadcast suppressed by another node every single time. 
 You could do it based on memory limitations and maximum bandwidth efficiency. Try to delay the advertisment as much as possible until you have a full message to send out. This will reduce the amount header being broadcasted. There would be a time limit so it doesn't wait forever. Perhaps 2 minutes because some people will be doing conversations that are entirely dependent upon seeing it has a missed message. It would also be sooner if memory is getting full. 
 I would use the existing prioritization chutil settings. The chutil that nodeinfo turns off will turn off nodeinfo advertisements. Follow this logic through for all types of data sent. If you want to completely turn off advertisements based on chutil on a role level then I would turn off roles with the best communication first. Turn off list. Earlier numbers turn off sooner. 
  | 
    
          
 @Ryu945, I think you may have missed the "re" in "rebroadcast". It's not just the originating node, it's all nodes that received that packet and forwarded it on.  | 
    
          
 My point is that every node that heard the message should advertise the message it heard. Not just the node that did the rebroadcast. This way, nodes can learn that a suppressed path is needed for message delivery. If only the rebroadcast node did the advertisement, the chances of finding a node that didn't hear the message the first time is much lower. Most request from advertisements are going to come from paths that never rebroadcasted.  | 
    
          
 That simply doesn't scale. The next version of this PR does have some logic in it that allows wider advertising than just rebroadcasters, though. Bear in mind that there have been significant changes in the way that this feature works since this PR was created, so what you're commenting on here is no longer an accurate representation of how the feature works. Would suggest waiting until I push the updated PR before sinking too much time into analysing the logic.  | 
    
This feature causes rebroadcasting nodes to periodically advertise a very lightweight, zero-hop summary of packets that they have recently rebroadcast. Other nodes can use this information to notice that they have missed packets, and request that the advertising node retransmit them again with single-packet granularity.
NOTE THAT THIS FEATURE IS NOT PRODUCTION-READY, and requires additional testing + integration with meshtastic's config system before being merged. It is a large chunk of new code, and while I have squished many of the included gremlins already, I have no doubt there are still some lurking in here waiting to give somebody a bad day.
For those thinking this looks like store & forward: it is. However, it is not intended as a general-purpose store & forward feature for clients, nor is it intended to replace the existing S&F feature. It is specifically intended for very lightweight, short-term caching in order to improve realtime transit reliability. It is not intended for "give me the last hour of messages later" type scenarios.
Purpose
The intention of this feature is to improve reliability between cooperating infrastructure sites, and largely eliminate packet loss due to temporary interference, collisions, obstructions etc. It should reduce the likelihood of a missed packet not being detected and subsequently rebroadcast to under 1%. This is implemented probabilistically in order to reduce the on-air payload.
As much as is practical, I would like to see the current "delivered to mesh" feedback that users receive become synonymous with "delivered to the entire mesh" (or in the case of DMs, to the destination node). I am of the opinion that high site operators should endeavour to meet that standard as closely as possible.
Feedback Notes
This feature is currently enabled by default, pending implementation of the necessary config elements to adjust it remotely. In the meantime, it can be disabled entirely if desired by defining MESHTASTIC_EXCLUDE_REPLAY to a non-zero value when building. All tunables are currently statically defined in ReplayModule.h.
I am currently seeking testers for this feature, to help work the remaining bugs out before it is proposed for merge. I am also looking for feedback regarding which tunables should be exposed via meshtastic's config system, and what appropriate default values might be. To assist with testing, they are currently set quite aggressively.
For testing, you'll need a device that is rebroadcasting packets and a device that is listening to those packets, both running this firmware, and a decent level of traffic. You'll also need to build the firmware image for this - the protobuf changes mean no automatic build artifacts.
The tunables are currently set to fairly aggressive values to facilitate testing.
Particular points I am interested in feedback about:
In order to minimise overhead on-air, this feature uses a non-protobuf payload.
The header is as follows:

Advertisements consist of the standard 2-byte header, followed by:
A typical advertisement will have a payload size of 8 bytes plus 2 bytes per included packet.
If packets are requested, but no longer cached, then the sender will send a state advertisement indicating which of its advertised packets are no longer available.
This consists of the standard 2-byte header, followed by:
Requests consist of the standard 2-byte header, followed by:
The following protobuf changes are required in order to use this feature: