Skip to content

Latest commit

 

History

History
49 lines (38 loc) · 2.73 KB

File metadata and controls

49 lines (38 loc) · 2.73 KB

Specification: RIBLT WAL Synchronization

This document specifies the design and protocol for synchronizing Write-Ahead Logs (WAL) using Rateless Invertible Bloom Lookup Tables (RIBLT).

1. Overview

The system enables efficient, two-way reconciliation of append-only logs between distributed nodes. It minimizes network traffic by only transmitting information proportional to the number of differences between the logs.

2. Data Model

Log Entry

A log entry is a single line in a text file with the format: [LSN]:[DATA].

  • LSN (Log Sequence Number): A monotonically increasing u64.
  • DATA: A UTF-8 string payload.

Log Symbol (RIBLT Symbol)

The identity of a log entry within the RIBLT set is defined by a LogSymbol:

  • lsn: The u64 sequence number.
  • hash: A u64 hash of the entry's data payload.

Important

Identity Hashing: For the RIBLT mapping function, the identity hash must consume both the lsn and the hash fields. This ensures that a change in data for an existing LSN is detected as a distinct item.

3. Hierarchical Reconciliation

To scale to large logs, the system uses a two-layer reconciliation strategy:

Layer 1: Checkpoints (Buckets)

  • The log is partitioned into fixed-size Buckets (e.g., 10 entries per bucket).
  • Each bucket is represented by a single LogSymbol:
    • lsn: The StartLSN of the bucket.
    • hash: An XOR-sum of the hashes of all entries in that bucket.
  • Nodes first reconcile the set of all bucket checkpoints.

Layer 2: Granular Entries

  • For every bucket identified as mismatching in Layer 1, nodes perform a granular reconciliation of the individual LogSymbols within that specific bucket range.

4. Protocol Flow

  1. Handshake: Initiator connects to peer via TCP.
  2. Checkpoint Exchange:
    • Initiator sends a stream of RIBLT symbols for its checkpoints.
    • Receiver subtracts its own checkpoints and decodes.
    • Receiver sends back its own checkpoint symbols for two-way sync.
  3. Bucket Range Identification: Both sides identify which start LSNs had mismatched hashes.
  4. Granular Exchange: For each mismatched range, nodes exchange symbols representing specific entries.
  5. Data Fetching: After identifying the exact LSNs missing on the peer, the full data payloads are requested and transmitted.

5. Distributed System Conventions

  • Immutability: Log entries are never modified once written. Discovery of a difference for an existing LSN is treated as a log branch/conflict.
  • Rateless Streaming: The number of symbols sent is adaptive. If decoding fails, the protocol allows requesting more symbols to resolve higher entropy differences.
  • Idempotency: Appending the same LSN multiple times is ignored if the hash matches.