Skip to content

Conversation

@erayd
Copy link
Contributor

@erayd erayd commented Sep 20, 2025

This feature causes rebroadcasting nodes to periodically advertise a very lightweight, zero-hop summary of packets that they have recently rebroadcast. Other nodes can use this information to notice that they have missed packets, and request that the advertising node retransmit them again with single-packet granularity.

NOTE THAT THIS FEATURE IS NOT PRODUCTION-READY, and requires additional testing + integration with meshtastic's config system before being merged. It is a large chunk of new code, and while I have squished many of the included gremlins already, I have no doubt there are still some lurking in here waiting to give somebody a bad day.

For those thinking this looks like store & forward: it is. However, it is not intended as a general-purpose store & forward feature for clients, nor is it intended to replace the existing S&F feature. It is specifically intended for very lightweight, short-term caching in order to improve realtime transit reliability. It is not intended for "give me the last hour of messages later" type scenarios.

Purpose

The intention of this feature is to improve reliability between cooperating infrastructure sites, and largely eliminate packet loss due to temporary interference, collisions, obstructions etc. It should reduce the likelihood of a missed packet not being detected and subsequently rebroadcast to under 1%. This is implemented probabilistically in order to reduce the on-air payload.

As much as is practical, I would like to see the current "delivered to mesh" feedback that users receive become synonymous with "delivered to the entire mesh" (or in the case of DMs, to the destination node). I am of the opinion that high site operators should endeavour to meet that standard as closely as possible.

Feedback Notes

This feature is currently enabled by default, pending implementation of the necessary config elements to adjust it remotely. In the meantime, it can be disabled entirely if desired by defining MESHTASTIC_EXCLUDE_REPLAY to a non-zero value when building. All tunables are currently statically defined in ReplayModule.h.

I am currently seeking testers for this feature, to help work the remaining bugs out before it is proposed for merge. I am also looking for feedback regarding which tunables should be exposed via meshtastic's config system, and what appropriate default values might be. To assist with testing, they are currently set quite aggressively.

For testing, you'll need a device that is rebroadcasting packets and a device that is listening to those packets, both running this firmware, and a decent level of traffic. You'll also need to build the firmware image for this - the protobuf changes mean no automatic build artifacts.

The tunables are currently set to fairly aggressive values to facilitate testing.

Particular points I am interested in feedback about:

  1. What is a sensible spacing between replayed packets, and how should this be scaled based on radio modulation (e.g. LONG_FAST vs SHORT_FAST) and current channel utilisation.
  2. What is an appropriate upper chutil threshold beyond which:
    1. ROUTER and ROUTER_LATE requests should not be fulfilled?
    2. Other role requests should not be fulfilled?
    3. Advertisements should not be sent at all, even for high-priority packets?
  3. Should non-infrastructure roles be allowed to request packet replay? I am of the opinion that this should be allowed. but should gracefully degrade with increasing chutil.
  4. What information should a future stats packet contain, and how frequently should it be sent? The point of the stats is to be able to remotely detect issues at a given rebroadcast site, or with particular requesters.
  5. The way memory management is handled, and where the balance between maximising the number of cached packets and being proactively nice should sit. Currently the cache is entirely on the heap, but this precludes sharing the TX & toPhone queues. Making it static would increase the maximum capacity, but lose the ability to be polite about heap space.

In order to minimise overhead on-air, this feature uses a non-protobuf payload.

The header is as follows:
Screenshot 2025-09-20 at 17 17 39

Advertisements consist of the standard 2-byte header, followed by:

  • A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
  • For each range indicated in the bitmap:
    • A 2-byte bitmap, indicating which of the packets in this range are referenced
    • A 2-byte priority bitmap, indicating which packets in this range are high priority
    • For each included packet, a 2-byte hash: ((from ^ id) >> 16 & 0xFFFF) ^ ((from ^ id) & 0xFFFF)
  • If the 'aggregate' flag is set, a 2-byte bitmap indicating which other adverts are included in this one
  • If the 'throttled' flag is set, a list of truncated-to-one-byte node IDs that have been throttled

A typical advertisement will have a payload size of 8 bytes plus 2 bytes per included packet.

If packets are requested, but no longer cached, then the sender will send a state advertisement indicating which of its advertised packets are no longer available.

This consists of the standard 2-byte header, followed by:

  • A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
  • For each range indicated in the bitmap:
    • A 2-byte bitmap, indicating which of the packets in this range are no longer available

Requests consist of the standard 2-byte header, followed by:

  • A 2-byte bitmap, indicating which of the 16 possible cache ranges are being requested
  • For each range indicated in the bitmap:
    • A 2-byte bitmap, indicating which of the packets in this range are being requested

The following protobuf changes are required in order to use this feature:

diff --git a/meshtastic/mesh.options b/meshtastic/mesh.options
index 37c9341..0f382cf 100644
--- a/meshtastic/mesh.options
+++ b/meshtastic/mesh.options
@@ -89,4 +89,6 @@
 
 *ChunkedPayload.chunk_count int_size:16
 *ChunkedPayload.chunk_index int_size:16
-*ChunkedPayload.payload_chunk max_size:228
\ No newline at end of file
+*ChunkedPayload.payload_chunk max_size:228
+
+*ReplayStats.servers max_count:8
diff --git a/meshtastic/mesh.proto b/meshtastic/mesh.proto
index 38000b4..d5d20f3 100644
--- a/meshtastic/mesh.proto
+++ b/meshtastic/mesh.proto
@@ -1238,6 +1238,12 @@ message MeshPacket {
      */
     DEFAULT = 64;
 
+    /*
+     * Replay messages should be a higher priority than default, but lower than other
+     * pre-defined high priority levels to ensure graceful degradation of service under load.
+     */
+    REPLAY = 68;
+
     /*
      * If priority is unset but the message is marked as want_ack,
      * assume it is important and use a slightly higher priority
@@ -1489,6 +1495,11 @@ message MeshPacket {
    * Indicates which transport mechanism this packet arrived over
    */
   TransportMechanism transport_mechanism = 21;
+
+  /*
+   * If true, this packet is managed by the replay module, and therefore should not be released elsewhere
+   */
+  bool is_replay_cached = 22;
 }
 
 /*
@@ -2397,3 +2408,148 @@ message ChunkedPayloadResponse {
     resend_chunks resend_chunks = 4;
   }
 }
+
+/**
+ * Stats about a particular replay server
+ */
+message ReplayServerStats {
+  /**
+   * The server nodenum
+   */
+  uint32 id = 1;
+
+  /**
+   * Total number of adverts received from this server
+   */
+  uint32 adverts_received = 2;
+
+  /**
+   * Total number of requests sent to this server
+   */
+  uint32 requests_sent = 3;
+
+  /**
+   * Total number of packets that we missed from this server
+   */
+  uint32 packets_missed = 4;
+
+  /**
+   * Age of last advert
+   */
+  uint32 last_advert_secs = 5;
+
+  /**
+   * Whether the server is a router
+   */
+  bool is_router = 6;
+
+  /**
+   * Whether the server is currently advertising priority pressure
+   */
+  bool priority = 7;
+}
+
+/**
+ * Stats about the replay cache
+ */
+message ReplayStats {
+  /**
+   * Number of seconds covered by these stats
+   */
+  uint32 window_length_secs = 1;
+
+  /**
+   * Number of packets currently tracked
+   */
+  uint32 current_size = 2;
+
+  /**
+   * Number of packets currently in the cache
+   */
+  uint32 current_cached = 3;
+
+  /**
+   * Number of adverts sent
+   */
+  uint32 adverts_sent = 4;
+
+  /**
+   * Number of adverts received
+   */
+  uint32 adverts_received = 5;
+
+  /**
+   * Number of expiry adverts sent
+   */
+  uint32 expired_sent = 6;
+
+  /**
+   * Number of expiry adverts received
+   */
+  uint32 expired_received = 7;
+
+  /**
+   * Number of requests sent
+   */
+  uint32 requests_sent = 8;
+
+  /**
+   * Number of requests sent (packets)
+   */
+  uint32 requests_sent_packets = 9;
+
+  /**
+   * Number of requests sent (high-priority packets)
+   */
+  uint32 requests_sent_packets_prio = 10;
+
+  /**
+   * Number of requests received
+   */
+  uint32 requests_received = 11;
+
+  /**
+   * Number of replayed packets
+   */
+  uint32 packets_replayed = 12;
+
+  /**
+   * Number of replayed packets (high-priority)
+   */
+  uint32 packets_replayed_prio = 13;
+
+  /**
+   * Number of unique advertisers seen
+   */
+  uint32 unique_advertisers = 14;
+
+  /**
+   * Number of unique requestors seen
+   */
+  uint32 unique_requestors = 15;
+
+  /**
+   * Number of throttled requestors
+   */
+  uint32 throttled_requestors = 16;
+
+  /**
+   * Number of packets rebroadcast
+   */
+  uint32 packets_rebroadcast = 17;
+
+  /**
+   * Number of packets rebroadcast (high-priority)
+   */
+  uint32 packets_rebroadcast_prio = 18;
+
+  /**
+   * Number of packets we missed
+   */
+  uint32 packets_missed = 19;
+
+  /**
+   * Server info
+   */
+  repeated ReplayServerStats servers = 20;
+}
diff --git a/meshtastic/portnums.proto b/meshtastic/portnums.proto
index e388a6f..5b2c0a0 100644
--- a/meshtastic/portnums.proto
+++ b/meshtastic/portnums.proto
@@ -134,6 +134,11 @@ enum PortNum {
    */
   PAXCOUNTER_APP = 34;
 
+  /*
+   * Used for the on-demand packet replay feature
+   */
+  REPLAY_APP = 35;
+
   /*
    * Provides a hardware serial interface to send and receive from the Meshtastic network.
    * Connect to the RX/TX pins of a device with 38400 8N1. Packets received from the Meshtastic

@wlockwood
Copy link

I haven't tested it yet but the concept looks great. I'm not sure how this is interacts with roles right now, but it should probably be disabled for CLIENT_MUTE, CLIENT_HIDDEN, and... TRACKER, maybe?

@erayd
Copy link
Contributor Author

erayd commented Sep 20, 2025

I'm not sure how this is interacts with roles right now, but it should probably be disabled for CLIENT_MUTE, CLIENT_HIDDEN, and... TRACKER, maybe?

It is supposed to degrade gracefully, with routers prioritised.

Are you able to expand a bit on the rationale for disabling it entirely on the roles you listed?

@h3lix1
Copy link
Contributor

h3lix1 commented Sep 20, 2025

This definitely looks like it works, and doesn't require a lot of processing power.

Not to cause you heartburn and hair loss, but have you considered minisketch? https://bitcoinops.org/en/topics/minisketch/

Obviously your header has more information than what packets are missing, but minisketch can re-create the full frame of missing IDs. maybe add the trickle algorithm.. Anyway, just ideas.

Now if the devices have enough CPU to do this? Maybe.

@erayd
Copy link
Contributor Author

erayd commented Sep 20, 2025

Obviously your header has more information than what packets are missing, but minisketch re-create the full frame of missing IDs.

I am explicitly not wanting to send the full packet identifier (the (from,id) tuple), in order to minimise airtime. The full tuple is 8 bytes, vs just 2 bytes for the hash.

maybe add the trickle algorithm

I'm unsure what that would achieve. Trickle seems completely inapplicable to what I'm trying to do here. Can you explain why specifically you think it's relevant, and what problem it would solve?

@wlockwood
Copy link

I'm not sure how this is interacts with roles right now, but it should probably be disabled for CLIENT_MUTE, CLIENT_HIDDEN, and... TRACKER, maybe?

It is supposed to degrade gracefully, with routers prioritised.

Are you able to expand a bit on the rationale for disabling it entirely on the roles you listed?

All of those roles are not intended to handle other people's packets at all, so it seems reasonable that you wouldn't want them to suddenly start moving forwarding packets - especially CLIENT_HIDDEN!

@erayd
Copy link
Contributor Author

erayd commented Sep 21, 2025

All of those roles are not intended to handle other people's packets at all, so it seems reasonable that you wouldn't want them to suddenly start moving forwarding packets - especially CLIENT_HIDDEN!

  1. That is actually only true of CLIENT_MUTE. All other roles may forward traffic if configured to do so, including TRACKER and CLIENT_HIDDEN.
  2. Nodes do not send advertisements at all unless they are actively rebroadcasting traffic.
  3. Even if not rebroadcasting traffic, all roles (including CLIENT_MUTE) can benefit from the ability to notice missing packets and request that the advertiser send them again.

@h3lix1
Copy link
Contributor

h3lix1 commented Sep 21, 2025

Think of minisketch as a way of forward error correction. If you're given two lists of IDs with one missing a few IDs, you can compare a list of 1000 items, and reconstruct the one (or two, or three, or n) missing items. Bytes sent as a "sketch" is dependent on how many items you're interested in being able to recover. (i.e. if you want to be able to recover 5 32 bit IDs out of 100 items, it only needs 155 bits.

It might be too complicated for the meshtastic use-case as it does require CPU (so maybe an esp32-only solution), but in situations where there are hundreds (or thousands) of packet IDs to compare, it takes the same message space as if there are 10 items to compare.

Check it out from https://github.com/bitcoin-core/minisketch .. install the library.

Compile the following...

// Build (after installing libminisketch):
//   cc -std=c99 -O2 -o msdemo msdemo.c -lminisketch -lstdc++
//
// MIT License (c) 2025 Clive Blackledge
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <inttypes.h>
#include <errno.h>
#include <stdbool.h>

#include <minisketch.h>

#define DEFAULT_BITS 32
#define DEFAULT_IMPL 0

// --- util: hex encode/decode -------------------------------------------------
static void bytes_to_hex(const unsigned char *in, size_t len, char *out) {
    static const char hexdig[] = "0123456789abcdef";
    for (size_t i = 0; i < len; ++i) {
        out[2*i]   = hexdig[(in[i] >> 4) & 0xF];
        out[2*i+1] = hexdig[in[i] & 0xF];
    }
    out[2*len] = '\0';
}

static int hex_to_bytes(const char *hex, unsigned char *out, size_t outlen) {
    size_t n = strlen(hex);
    if (n != outlen * 2) return -1;
    for (size_t i = 0; i < outlen; ++i) {
        char c1 = hex[2*i], c2 = hex[2*i+1];
        int v1 = (c1 >= '0' && c1 <= '9') ? c1 - '0' :
                 (c1 >= 'a' && c1 <= 'f') ? c1 - 'a' + 10 :
                 (c1 >= 'A' && c1 <= 'F') ? c1 - 'A' + 10 : -1;
        int v2 = (c2 >= '0' && c2 <= '9') ? c2 - '0' :
                 (c2 >= 'a' && c2 <= 'f') ? c2 - 'a' + 10 :
                 (c2 >= 'A' && c2 <= 'F') ? c2 - 'A' + 10 : -1;
        if (v1 < 0 || v2 < 0) return -1;
        out[i] = (unsigned char)((v1 << 4) | v2);
    }
    return 0;
}

// --- util: random 32-bit, nonzero -------------------------------------------
static uint32_t rand32(void) {
    // portable-ish 32-bit RNG from rand(); good enough for demo
    uint32_t a = (uint32_t)rand();
    uint32_t b = (uint32_t)rand();
    return (a << 16) ^ (b & 0xFFFFU);
}

static int cmp_u32(const void *a, const void *b) {
    uint32_t ua = *(const uint32_t*)a, ub = *(const uint32_t*)b;
    return (ua < ub) ? -1 : (ua > ub);
}

// --- parse a single 0x######## or decimal line -------------------------------
static bool parse_u32(const char *s, uint32_t *out) {
    while (*s == ' ' || *s == '\t') ++s;
    if (*s == '\0' || *s == '#') return false; // blank or comment
    errno = 0;
    char *end = NULL;
    uint64_t v = 0;
    if (s[0] == '0' && (s[1] == 'x' || s[1] == 'X')) {
        v = strtoull(s + 2, &end, 16);
    } else {
        v = strtoull(s, &end, 10);
    }
    if (errno != 0 || end == s || v > 0xFFFFFFFFULL) return false;
    if (v == 0) return false; // minisketch doesn't allow 0 elements
    *out = (uint32_t)v;
    return true;
}

// --- help --------------------------------------------------------------------
static void usage(const char *prog) {
    fprintf(stderr,
        "Usage:\n"
        "  %s -n N -c CAP [-b BITS=32] [-i IMPL=0] [-s SEED]\n"
        "    Generate N random 32-bit items, build sketch with capacity CAP,\n"
        "    print list and sketch hex.\n"
        "\n"
        "  %s -r -c CAP [--sketch HEX] [-b BITS=32] [-i IMPL=0]\n"
        "    Recover: read our (possibly missing) list from stdin (one item/line),\n"
        "    ask for or accept the sketch hex, and print recovered items.\n"
        "\n", prog, prog);
}

// --- main --------------------------------------------------------------------
int main(int argc, char **argv) {
    int n = -1;                // number of items to generate (in gen mode)
    int cap = -1;              // sketch capacity (required)
    int bits = DEFAULT_BITS;   // element bit width (default 32)
    int impl = DEFAULT_IMPL;   // implementation (0=portable default)
    int seed = 0;              // RNG seed (optional)
    bool have_seed = false;
    bool recover_mode = false;
    const char *sketch_hex_arg = NULL;

    // very simple arg parsing
    for (int i = 1; i < argc; ++i) {
        if (!strcmp(argv[i], "-n") && i+1 < argc)       { n = atoi(argv[++i]); }
        else if (!strcmp(argv[i], "-c") && i+1 < argc)  { cap = atoi(argv[++i]); }
        else if (!strcmp(argv[i], "-b") && i+1 < argc)  { bits = atoi(argv[++i]); }
        else if (!strcmp(argv[i], "-i") && i+1 < argc)  { impl = atoi(argv[++i]); }
        else if (!strcmp(argv[i], "-s") && i+1 < argc)  { seed = atoi(argv[++i]); have_seed = true; }
        else if (!strcmp(argv[i], "-r"))                { recover_mode = true; }
        else if (!strcmp(argv[i], "--sketch") && i+1 < argc) { sketch_hex_arg = argv[++i]; }
        else { usage(argv[0]); return 1; }
    }

    if (cap <= 0) { usage(argv[0]); return 1; }
    if (bits <= 0 || bits > 64) { fprintf(stderr, "BITS must be 1..64\n"); return 1; }

    if (!recover_mode) {
        if (n <= 0) { usage(argv[0]); return 1; }

        if (!have_seed) seed = (int)time(NULL);
        srand((unsigned)seed);

        // Generate N unique, nonzero 32-bit values.
        uint32_t *vals = (uint32_t*)malloc((size_t)n * sizeof(uint32_t));
        if (!vals) { fprintf(stderr, "oom\n"); return 1; }
        int count = 0;
        while (count < n) {
            uint32_t v = 0;
            do { v = rand32(); } while (v == 0);
            // ensure uniqueness (O(n^2) is fine for small demo)
            bool dup = false;
            for (int j = 0; j < count; ++j) if (vals[j] == v) { dup = true; break; }
            if (!dup) vals[count++] = v;
        }
        qsort(vals, (size_t)n, sizeof(uint32_t), cmp_u32);

        // Build sketch over our set.
        minisketch *sk = minisketch_create((uint32_t)bits, (uint32_t)impl, (size_t)cap);
        if (!sk) { fprintf(stderr, "minisketch_create failed\n"); free(vals); return 1; }
        for (int i = 0; i < n; ++i) minisketch_add_uint64(sk, (uint64_t)vals[i]);

        // Serialize sketch to hex.
        size_t sersz = minisketch_serialized_size(sk);
        unsigned char *ser = (unsigned char*)malloc(sersz);
        if (!ser) { fprintf(stderr, "oom\n"); minisketch_destroy(sk); free(vals); return 1; }
        minisketch_serialize(sk, ser);
        char *hex = (char*)malloc(2*sersz + 1);
        if (!hex) { fprintf(stderr, "oom\n"); free(ser); minisketch_destroy(sk); free(vals); return 1; }
        bytes_to_hex(ser, sersz, hex);

        // Print results.
        printf("# minisketch demo (generate)\n");
        printf("bits=%d capacity=%d impl=%d seed=%d serialized_bytes=%zu\n",
               bits, cap, impl, seed, sersz);
        printf("sketch_hex=%s\n", hex);
        printf("items(%d):\n", n);
        for (int i = 0; i < n; ++i) printf("0x%08" PRIx32 "\n", vals[i]);

        // Also print a copy-pasteable recover command for convenience
        printf("\n# Recover example (paste your missing list to stdin):\n");
        printf("#   ./msdemo -r -c %d -b %d -i %d --sketch %s < missing.txt\n",
               cap, bits, impl, hex);

        free(hex); free(ser); minisketch_destroy(sk); free(vals);
        return 0;
    }

    // --- recover mode ---
    // Read (possibly missing) list from stdin
    size_t cap_items = 64, num_items = 0;
    uint32_t *mine = (uint32_t*)malloc(cap_items * sizeof(uint32_t));
    if (!mine) { fprintf(stderr, "oom\n"); return 1; }

    // If sketch not on CLI, ask once (reading from stdin).
    char *sketch_hex = NULL;
    if (!sketch_hex_arg) {
        fprintf(stderr, "Paste sketch hex, then press Enter. After that, paste your list:\n");
        char line[65536];
        if (!fgets(line, sizeof(line), stdin)) { fprintf(stderr, "no input\n"); free(mine); return 1; }
        size_t len = strlen(line);
        while (len && (line[len-1] == '\r' || line[len-1] == '\n')) line[--len] = '\0';
        sketch_hex = strdup(line);
    } else {
        sketch_hex = strdup(sketch_hex_arg);
    }
    if (!sketch_hex) { fprintf(stderr, "oom\n"); free(mine); return 1; }

    // Now read the items
    char buf[65536];
    while (fgets(buf, sizeof(buf), stdin)) {
        uint32_t v;
        if (!parse_u32(buf, &v)) continue;
        if (num_items == cap_items) {
            cap_items *= 2;
            uint32_t *tmp = (uint32_t*)realloc(mine, cap_items * sizeof(uint32_t));
            if (!tmp) { fprintf(stderr, "oom\n"); free(mine); free(sketch_hex); return 1; }
            mine = tmp;
        }
        // avoid duplicates in our set
        bool dup = false;
        for (size_t j = 0; j < num_items; ++j) if (mine[j] == v) { dup = true; break; }
        if (!dup) mine[num_items++] = v;
    }

    // Build our sketch and merge with the remote one
    minisketch *sk_local = minisketch_create((uint32_t)bits, (uint32_t)impl, (size_t)cap);
    if (!sk_local) { fprintf(stderr, "minisketch_create failed\n"); free(mine); free(sketch_hex); return 1; }
    for (size_t i = 0; i < num_items; ++i) minisketch_add_uint64(sk_local, (uint64_t)mine[i]);

    minisketch *sk_remote = minisketch_create((uint32_t)bits, (uint32_t)impl, (size_t)cap);
    if (!sk_remote) { fprintf(stderr, "minisketch_create failed\n"); free(mine); free(sketch_hex); minisketch_destroy(sk_local); return 1; }
    size_t sersz = minisketch_serialized_size(sk_remote);
    unsigned char *ser = (unsigned char*)malloc(sersz);
    if (!ser) { fprintf(stderr, "oom\n"); free(mine); free(sketch_hex); minisketch_destroy(sk_local); minisketch_destroy(sk_remote); return 1; }
    if (hex_to_bytes(sketch_hex, ser, sersz) != 0) {
        fprintf(stderr, "Invalid sketch hex or wrong -b/-c (expected %zu bytes -> %zu hex chars)\n", sersz, 2*sersz);
        free(mine); free(sketch_hex); minisketch_destroy(sk_local); minisketch_destroy(sk_remote); free(ser);
        return 1;
    }
    free(sketch_hex);

    minisketch_deserialize(sk_remote, ser);
    free(ser);

    // XOR: sketch_local := sketch_local (+) sketch_remote
    minisketch_merge(sk_local, sk_remote);
    minisketch_destroy(sk_remote);

    // Decode up to 'cap' differences
    uint64_t *diffs = (uint64_t*)calloc((size_t)cap, sizeof(uint64_t));
    if (!diffs) { fprintf(stderr, "oom\n"); free(mine); minisketch_destroy(sk_local); return 1; }
    ssize_t ndiff = minisketch_decode(sk_local, (size_t)cap, diffs);
    minisketch_destroy(sk_local);

    if (ndiff < 0) {
        fprintf(stderr, "Decode failed: actual difference > capacity (-c %d). Re-run with larger -c.\n", cap);
        free(diffs); free(mine);
        return 2;
    }

    // Classify: items not in our set are "missing_from_us"
    // (If there are extras on our side, they'll also appear in diffs.)
    printf("# minisketch demo (recover)\n");
    printf("bits=%d capacity=%d impl=%d decoded_differences=%zd\n", bits, cap, impl, ndiff);

    // Build a quick membership check
    qsort(mine, num_items, sizeof(uint32_t), cmp_u32);

    printf("differences:\n");
    for (ssize_t i = 0; i < ndiff; ++i) {
        uint32_t v = (uint32_t)diffs[i];
        printf("  0x%08" PRIx32 "\n", v);
    }

    printf("missing_from_us:\n");
    for (ssize_t i = 0; i < ndiff; ++i) {
        uint32_t v = (uint32_t)diffs[i];
        // binary search in 'mine'
        size_t lo = 0, hi = num_items;
        bool found = false;
        while (lo < hi) {
            size_t mid = (lo + hi) / 2;
            if (mine[mid] == v) { found = true; break; }
            if (mine[mid] < v) lo = mid + 1; else hi = mid;
        }
        if (!found) printf("  0x%08" PRIx32 "\n", v);
    }

    free(diffs); free(mine);
    return 0;
}

The demo basically creates a list of 32 bit hex values. delete a few, and with the use of the "sketch" it can re-create the values that are missing in the list.

So in this case. Create a list of 1000 values, with -c of 3, allowing for any three items to disappear/change in this list.

(base) cbb@cbbs-Mac-Studio minisketch % ./msdemo -n 1000 -c 3 > A.txt
(base) cbb@cbbs-Mac-Studio minisketch % head -10 A.txt
# minisketch demo (generate)
bits=32 capacity=3 impl=0 seed=1758421698 serialized_bytes=12
sketch_hex=ddcd710d43aa58e76db6f004
items(1000):
0x001d8995
0x009bc928
0x00a412d8
0x00d20f65
0x014d95fa
0x029c708d
... many more items here...

Now deleted three in the list.. (items 2-5)

base) cbb@cbbs-Mac-Studio minisketch % head -10 A.txt
# minisketch demo (generate)
bits=32 capacity=3 impl=0 seed=1758421698 serialized_bytes=12
sketch_hex=ddcd710d43aa58e76db6f004
items(1000):
0x001d8995
0x014d95fa
0x029c708d
0x02dfa0e0
0x034a2e80
0x03854ac5

Now using the remaining list + sketch, I'm able to recreate the missing items. It found the missing 3 items in the list of 1000 with the sketch.

(base) cbb@cbbs-Mac-Studio minisketch %  ./msdemo -r -c 3 -b 32 -i 0 --sketch ddcd710d43aa58e76db6f004 < A.txt
# minisketch demo (recover)
bits=32 capacity=3 impl=0 decoded_differences=3
differences:
  0x00d20f65
  0x009bc928
  0x00a412d8
missing_from_us:
  0x00d20f65
  0x009bc928
  0x00a412d8

The order of the list doesn't matter. It can also see if there are any on our side that are different than the remote side.. for example I changed 0x009bc928 back and added to 0x009bc929, it saw the differences.

(base) cbb@cbbs-Mac-Studio minisketch %  ./msdemo -r -c 3 -b 32 -i 0 --sketch ddcd710d43aa58e76db6f004 < A.txt
# minisketch demo (recover)
bits=32 capacity=3 impl=0 decoded_differences=3
differences:
  0x00d20f65
  0x00a412d8
  0x009bc929
missing_from_us:
  0x00d20f65
  0x00a412d8

With a 200 byte sketch, it should be possible to recover 50 32 bit IDs out of a list. You can still add your bitfield to this and make it 43 bytes and recover 37 IDs and have a few bits left over. Given IDs normally use (sender, id) as the keys, that will be about 25 (sender, id) pairs.

The trickle algorithm is simply a way to determine how often to send out the 0-hop packet.

Now if there are 25 missing messages, getting those across will be its own challenge.

@erayd
Copy link
Contributor Author

erayd commented Sep 21, 2025

What happens if half the list is missing? My impression is that this would render the algorithm entirely useless, yes?

Does the receiving receiving node need to successfully receive data about e.g. 100 packets in order to reconstruct the missing ten? If so, then this would cause an unacceptable delay, vs being able to immediately request even if the history is unknown.

Recovering 37 identifiers plus bitfields from 200 bytes is also considerably less efficient than what I'm already doing. Remember also that meshtastic's packet identifiers are 64 bits long, not 32. The 'from' field is part of the identifier.

From what I've seen here, the sketch / trickle stuff looks like a neat system to solve a different problem, but still doesn't IMO seem terribly applicable to this one.

@h3lix1
Copy link
Contributor

h3lix1 commented Sep 21, 2025

It doesn't have to wait for 100, it could be variable. It can also be variable how many can be recovered.

I did mention about (sender, id) as part of the key, which will be 64 bits. I guess my terminology for calling it "sender"
instead of "from"

It loses a lot of efficiency if more than half are missing, but at least it will know exactly which ones are missing.

No worries if you think it doesn't fit here - I was just providing a potential option with new shiny toy I found the other day.

@erayd
Copy link
Contributor Author

erayd commented Sep 21, 2025

Yeah, I don't think it's really a good fit for this particular problem. Thanks though 🙂

This feature causes rebroadcasting nodes to periodically advertise a zero-hop
summary of packets that they have recently rebroadcast. Other nodes can use
this information to notice that they have missed packets, and request that the
advertising node retransmit them again.

This feature is currently enabled by default, pending implementation of the
necessary config elements to adjust it remotely. In the meantime, it can be
disabled entirely by defining MESHTASTIC_EXCLUDE_REPLAY to a non-zero value
when building. All tunables are currently statically defined in ReplayModule.h.

In order to minimise overhead on-air, this feature uses a non-protobuf payload:

| Offset | Size | Description                                                            |
|--------|------|------------------------------------------------------------------------|
| 0      | 2    | Advert or request type                                                 |
| 2      | 1    | This message advertises or requests high-priority packets only         |
| 3      | 1    | (adverts only) This is the first advert since the sender booted        |
| 4      | 1    | The sender is using an infrastructure rebroadcast role                 |
| 5      | 1    | (adverts only) This is an aggregate of specific prior advertisements   |
| 6      | 1    | (adverets only) This advertisement contains a list of throttled clients|
| 7      | 1    | Reserved for future use                                                |
| 8      | 5    | The base sequence number to which this advertisement or request refers |
| 13     | 1    | Reserved for future use                                                |
| 14     | 1    | Reserved for future use                                                |
| 15     | 1    | Reserved for future use                                                |

Advertisements consist of the standard 2-byte header, followed by:

 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are referenced
    - A 2-byte priority bitmap, indicating which packets in this range are high priority
    - For each included packet, a 2-byte hash: ((from ^ id) >> 16 & 0xFFFF) ^ ((from ^ id) & 0xFFFF)
 - If the 'aggregate' flag is set, a 2-byte bitmap indicating which other adverts are included in this one
 - If the 'throttled' flag is set, a list of truncated-to-one-byte node IDs that have been throttled

A typical advertisement will have a payload size of 8 bytes plus 2 bytes per included packet.

If packets are requested, but no longer cached, then the sender will send a
state advertisement indicating which of its advertised packets are no longer
available.

This consists of the standard 2-byte header, followed by:
 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are present
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are no longer available

Requests consist of the standard 2-byte header, followed by:
 - A 2-byte bitmap, indicating which of the 16 possible cache ranges are being requested
 - For each range indicated in the bitmap:
    - A 2-byte bitmap, indicating which of the packets in this range are being requested
@erayd
Copy link
Contributor Author

erayd commented Sep 23, 2025

Latest rebase & stats update has introduced something that causes spontaneous reboots. Am investigating why currently, but if you are testing this, please bear in mind that this PR is currently unstable.

@erayd
Copy link
Contributor Author

erayd commented Sep 25, 2025

Am considering changing the approach to notably simplify the code, at the expense of slightly more overhead. A couple of people I've run this past have had a hard time understanding it, and for the sake of maintainability that makes me wonder if I've drawn the airtime optimisation vs complexity line in the wrong place.

Will do a parallel PR that solves this same problem in a different way, and compare the result.

[Edit October 2nd: The parallel approach is better. Watch this space. Should have it substantially ready to go within the next couple of weeks.]

[October 16th progress update: Taking a bit longer than expected, but making steady progress. Hoping a new PR by the end of next week. New caching approach already merged to master.]

[October 27th progress update: Following discussion with @NomDeTom, this thing has grown some extra legs - so the core PR is taking longer than anticipated. I have started pushing the supporting parts (cache, stats, unencrypted payloads) as separate PRs.]

@korbinianbauer
Copy link
Contributor

As much as is practical, I would like to see the current "delivered to mesh" feedback that users receive become synonymous with "delivered to the entire mesh" [...]

When you say "the entire mesh", what exactly do you mean?

My understanding is that a hypothetical, all-knowing, ideal routing algorithm should (for broadcast/channel messages to !ffffff):

  1. Reach all nodes that (assuming the optimal route) could be reached within the given hop-limit
  2. Reach no nodes that cannot be reached within the given hop-limit
  3. Do so with as little airtime as possible
  4. Do so as quickly as possible

For completeness, a hypothetical, all-knowing, ideal routing algorithm should (for direct messages to a specific node):

  1. Reach that node
  2. Do so with as little airtime as possible
  3. Do so as quickly as possible

@erayd
Copy link
Contributor Author

erayd commented Oct 18, 2025

When you say "the entire mesh", what exactly do you mean?

I mean pretty much as you summarised it. If a node B is reachable within the hop limit of node A, then a broadcast message from node A should be successfully delivered to node B, without undue overhead or inefficiency.

@fifieldt
Copy link
Member

@erayd , is it OK if we re-target this to the develop branch? Master is going into a feature freeze while we get a beta out.

@erayd
Copy link
Contributor Author

erayd commented Oct 22, 2025

@erayd , is it OK if we re-target this to the develop branch? Master is going into a feature freeze while we get a beta out.

@fifieldt This PR is going to be closed without merge, because I've significantly changed the way this feature works - I'm intending to open a new PR for that incarnation of it later this week hopefully next week. Given the feature freeze here, I'll open that one against develop.

I would appreciate it if this PR could remain against master, because it keeps the diff clean without needing me to rebase it (which will require conflict resolution). It's essentially just a reference PR at this point, and I'm intending to close it once the new PR is ready.

@Ryu945
Copy link

Ryu945 commented Oct 29, 2025

All of those roles are not intended to handle other people's packets at all, so it seems reasonable that you wouldn't want them to suddenly start moving forwarding packets - especially CLIENT_HIDDEN!

1. That is actually only true of CLIENT_MUTE. All other roles may forward traffic if configured to do so, including TRACKER and CLIENT_HIDDEN.

2. Nodes do not send advertisements at all unless they are actively rebroadcasting traffic.

3. Even if not rebroadcasting traffic, all roles (including CLIENT_MUTE) can benefit from the ability to notice missing packets and request that the advertiser send them again.

Any role designed to receive messages should be able to respond to a replay advertisement message. This means that Client_Mute should be able to respond to one with a request along with other message receiving roles. The only role that should probably not do this is roles that are intentionally trying to maintain a no broadcast policy or roles that only broadcast but don't receive.

Able to respond roles:

Client: yes
Client_Mute: yes
Client_Hidden: Maybe No (Trying to be hidden or low power)
Client_Base: yes
Tracker: No (It doesn't care to receive messages)
Lost and Found: No ( It only cares about its lost and found features)
Sensors: Maybe yes (It can be a relay but it should be possible to turn off)
Router: yes
Router_Late: yes

I don't know anything about Tak but if it works the way I think it does:

Tak: Yes
Tak_Tracker: No

The other sides is who should send out advertisements.

Client: Yes
Client_Mute: No (It doesn't get involved with message forwarding. Notice how Request for this role was a yes)
Client_Hidden: Maybe No
Client_Base: Yes
Tracker: No (Doesn`t forward messages)
Lost and Found: No ( It only cares about lost and found)
Sensor: Maybe (It can be a relay for another node but it should be able to be turned off. Maybe on by default)
Router: Yes
Rouer_Late: Yes

As for the TAKs:

TAK_TRACKER: No
TAK: Yes

@Ryu945
Copy link

Ryu945 commented Oct 29, 2025

This feature causes rebroadcasting nodes to periodically advertise a very lightweight, zero-hop summary of packets that they have recently rebroadcast. Other nodes can use this information to notice that they have missed packets, and request that the advertising node retransmit them again with single-packet granularity.

This description sounds like a problem. The point of this feature is to help with paths that normally get suppressed. It should be all nodes involved in relaying messages that make these periodic advertisements. Not just the one that did the actual broadcast. One of the primary uses is if there is a node cut off because its only connecting node gets its broadcast suppressed by another node every single time.

What is a sensible spacing between replayed packets, and how should this be scaled based on radio modulation

You could do it based on memory limitations and maximum bandwidth efficiency. Try to delay the advertisment as much as possible until you have a full message to send out. This will reduce the amount header being broadcasted. There would be a time limit so it doesn't wait forever. Perhaps 2 minutes because some people will be doing conversations that are entirely dependent upon seeing it has a missed message. It would also be sooner if memory is getting full.

What is an appropriate upper chutil threshold beyond which:

I would use the existing prioritization chutil settings. The chutil that nodeinfo turns off will turn off nodeinfo advertisements. Follow this logic through for all types of data sent.

If you want to completely turn off advertisements based on chutil on a role level then I would turn off roles with the best communication first.

Turn off list. Earlier numbers turn off sooner.

  1. Router (People are going to hear it most likely anyway so you mine as well turn this off first. Save bandwidth for a role not heard as well.)

  2. Router_Late

  3. Every other Role (This turns off advertisement all together.)

@wlockwood
Copy link

This feature causes rebroadcasting nodes to periodically advertise a very lightweight, zero-hop summary of packets that they have recently rebroadcast. Other nodes can use this information to notice that they have missed packets, and request that the advertising node retransmit them again with single-packet granularity.

This description sounds like a problem. The point of this feature is to help with paths that normally get suppressed. It should be all nodes involved in relaying messages that make these periodic advertisements. Not just the one that did the actual broadcast. One of the primary uses is if there is a node cut off because its only connecting node gets its broadcast suppressed by another node every single time.

@Ryu945, I think you may have missed the "re" in "rebroadcast". It's not just the originating node, it's all nodes that received that packet and forwarded it on.

@Ryu945
Copy link

Ryu945 commented Oct 30, 2025

This feature causes rebroadcasting nodes to periodically advertise a very lightweight, zero-hop summary of packets that they have recently rebroadcast. Other nodes can use this information to notice that they have missed packets, and request that the advertising node retransmit them again with single-packet granularity.

This description sounds like a problem. The point of this feature is to help with paths that normally get suppressed. It should be all nodes involved in relaying messages that make these periodic advertisements. Not just the one that did the actual broadcast. One of the primary uses is if there is a node cut off because its only connecting node gets its broadcast suppressed by another node every single time.

@Ryu945, I think you may have missed the "re" in "rebroadcast". It's not just the originating node, it's all nodes that received that packet and forwarded it on.

My point is that every node that heard the message should advertise the message it heard. Not just the node that did the rebroadcast. This way, nodes can learn that a suppressed path is needed for message delivery. If only the rebroadcast node did the advertisement, the chances of finding a node that didn't hear the message the first time is much lower. Most request from advertisements are going to come from paths that never rebroadcasted.

@erayd
Copy link
Contributor Author

erayd commented Oct 30, 2025

My point is that every node that heard the message should advertise the message it heard. Not just the node that did the rebroadcast. This way, nodes can learn that a suppressed path is needed for message delivery.

That simply doesn't scale. The next version of this PR does have some logic in it that allows wider advertising than just rebroadcasters, though.

Bear in mind that there have been significant changes in the way that this feature works since this PR was created, so what you're commenting on here is no longer an accurate representation of how the feature works. Would suggest waiting until I push the updated PR before sinking too much time into analysing the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants