This document specifies the design and protocol for synchronizing Write-Ahead Logs (WAL) using Rateless Invertible Bloom Lookup Tables (RIBLT).
The system enables efficient, two-way reconciliation of append-only logs between distributed nodes. It minimizes network traffic by only transmitting information proportional to the number of differences between the logs.
A log entry is a single line in a text file with the format: [LSN]:[DATA].
- LSN (Log Sequence Number): A monotonically increasing
u64. - DATA: A UTF-8 string payload.
The identity of a log entry within the RIBLT set is defined by a LogSymbol:
- lsn: The
u64sequence number. - hash: A
u64hash of the entry's data payload.
Important
Identity Hashing: For the RIBLT mapping function, the identity hash must consume both the lsn and the hash fields. This ensures that a change in data for an existing LSN is detected as a distinct item.
To scale to large logs, the system uses a two-layer reconciliation strategy:
- The log is partitioned into fixed-size Buckets (e.g., 10 entries per bucket).
- Each bucket is represented by a single
LogSymbol:lsn: TheStartLSNof the bucket.hash: An XOR-sum of the hashes of all entries in that bucket.
- Nodes first reconcile the set of all bucket checkpoints.
- For every bucket identified as mismatching in Layer 1, nodes perform a granular reconciliation of the individual
LogSymbolswithin that specific bucket range.
- Handshake: Initiator connects to peer via TCP.
- Checkpoint Exchange:
- Initiator sends a stream of RIBLT symbols for its checkpoints.
- Receiver subtracts its own checkpoints and decodes.
- Receiver sends back its own checkpoint symbols for two-way sync.
- Bucket Range Identification: Both sides identify which start LSNs had mismatched hashes.
- Granular Exchange: For each mismatched range, nodes exchange symbols representing specific entries.
- Data Fetching: After identifying the exact LSNs missing on the peer, the full data payloads are requested and transmitted.
- Immutability: Log entries are never modified once written. Discovery of a difference for an existing LSN is treated as a log branch/conflict.
- Rateless Streaming: The number of symbols sent is adaptive. If decoding fails, the protocol allows requesting more symbols to resolve higher entropy differences.
- Idempotency: Appending the same LSN multiple times is ignored if the hash matches.