-
-
Notifications
You must be signed in to change notification settings - Fork 120
docs: add comprehensive binary encoding format documentation #898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add detailed documentation for Loro's binary encoding format including: - Overall binary structure with header format - FastSnapshot and FastUpdates body formats - KV Store (SSTable) encoding details - OpLog and State encoding schemas - Change Block structure and layout - Value encoding types and tags - Compression techniques (LEB128, RLE, Delta, LZ4) Mark incomplete sections with checkboxes for future expansion. This enables developers to implement Loro-compatible encoders/decoders in other programming languages.
Expand encoding.md with comprehensive documentation for all encoding formats: - SSTable Block Chunk format (Normal + Large blocks) - Key-Value chunk encoding with prefix compression - VersionVector and Frontiers encoding (postcard format) - ContainerID encoding (Root vs Normal containers) - ContainerWrapper encoding structure - ContainerArena columnar encoding - PositionArena with prefix compression - Complete Value encoding for all 17 value types - serde_columnar format with strategies - Compression techniques: - LEB128 (unsigned and signed/zigzag) - BoolRle, AnyRle, DeltaRle, DeltaOfDelta - LZ4 block compression All implementation checklist items are now marked complete. This enables developers to implement Loro-compatible encoders/decoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf25167ef9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
WASM Size Report
|
- Fix Block Key endianness: ID.to_bytes() uses big-endian, not little-endian - Clarify SSTable magic bytes format with explicit byte array notation - Fix VersionVector postcard encoding: use 'varint' instead of fixed sizes - Correct ContainerArena field name to key_idx_or_counter
The documentation incorrectly stated that signed integers use zigzag encoding before LEB128. The actual implementation uses standard signed LEB128 (SLEB128) with two's complement representation. - Clarify difference between unsigned and signed LEB128 - Add note that postcard (VersionVector, Frontiers) uses zigzag internally - Update checklist to reflect SLEB128 usage
Add specific file:line references for every encoding format section: - Block Key format (ID.to_bytes) - State KV Schema (FRONTIERS_KEY) - Operations encoding (EncodedOp struct) - Key Strings encoding - Delete Start IDs encoding - Value encoding details (read/write functions) - LoroValue, MarkStart, TreeMove, ListMove, ListSet - LEB128 signed/unsigned usage - BoolRle, AnyRle, DeltaRle, DeltaOfDelta - LZ4 compression - Endianness exceptions with sources - Checksums with sources - Constants with sources This makes it easier for reviewers to verify correctness of each claim.
For implementers in other languages (where postcard/serde_columnar crates are not available), add detailed wire format specifications: Postcard format: - Unsigned varint encoding with examples - Zigzag encoding for signed integers with decode formula - Complete type encoding table (bool, integers, floats, Option, Vec, etc.) - Important note about postcard f64 (LE) vs Loro value F64 (BE) serde_columnar format: - Overall columnar structure with row count and column lengths - BoolRle strategy with alternating true/false run counts - Rle strategy for arbitrary types - DeltaRle strategy with first value + delta runs - Concrete example using ContainerArena encoding Sources: - https://postcard.jamesmunns.com/wire-format.html - https://github.com/loro-dev/columnar
For developers implementing Loro encoding in JavaScript without any external dependencies, add: LEB128 algorithms: - Complete JavaScript encoding/decoding functions for unsigned LEB128 - Complete JavaScript encoding/decoding functions for signed LEB128 (SLEB128) - Multiple examples showing byte output for various values - Clear note differentiating SLEB128 from zigzag encoding MarkStart info byte: - Document bit layout (ALIVE, EXPAND_BEFORE, EXPAND_AFTER flags) - Common patterns for bold, link, comment styles - Source reference to TextStyleInfoFlag BoolRle improvements: - Clarify that encoding always starts with true count (can be 0) - Add multiple examples including edge cases AnyRle improvements: - Specify that values use LEB128 encoding - Different handling for u32/usize (unsigned) vs i32 (signed) - Concrete byte-level example
- BoolRle: starts with FALSE count, not TRUE count (encoder starts in "false" state). Fixed all examples. - AnyRle/Rle: format is [signed_length, value], not [value, length]. Length uses zigzag encoding and can be negative for literal runs. - DeltaOfDelta: uses sophisticated bit-packed format, not LEB128. Added complete prefix code table and decoding algorithm. - DeltaRle: referenced the DeltaRle Column Strategy section. - Added source references pointing to serde_columnar-0.3.14/src/strategy/rle.rs
Add detailed specifications for components critical to pure JavaScript implementations: - encoding-xxhash32.md: Complete xxHash32 algorithm with JS code and test vectors - encoding-lz4.md: LZ4 Frame format specification with decompression algorithm - encoding-container-states.md: Container state snapshot formats for Map, List, Text (Richtext), Tree, and MovableList containers Update encoding.md with: - Links to supplementary documentation in table of contents - Cross-references in relevant sections (checksum, state encoding, compression) - New "Supplementary Documentation" section - Extended implementation checklist with container state items
Address two critical documentation gaps:
1. ContainerType postcard serde mapping:
- Document that postcard serialization uses a DIFFERENT historical mapping
than ContainerID.to_bytes() (e.g., Text=0 vs Text=2)
- This affects Option<ContainerID> decoding in ContainerWrapper.parent
- Added comparison table in both encoding.md and encoding-container-states.md
2. Op prop field semantics:
- Document how the `prop` column value is computed per container/op type
- List/Text: position for Insert/Delete, start for StyleStart
- MovableList: position for Insert/Delete, target for Move
- Map: key index into keys arena
- Tree: always 0
These were blocking issues for implementing a complete decoder.
|
Review notes (goal: a dependency-free, pure-JS encoder/decoder can be written from this doc alone):
With these fixes/clarifications, it’ll be much easier to confidently claim: “a pure JS implementer, with zero deps, can implement encoder/decoder from |
Address review notes for pure-JS encoder/decoder implementation: 1. Checksum coverage fix: - Checksum covers bytes[20..] (encode_mode + body), NOT just body - Added explicit warning and formula clarification 2. JS LEB128 BigInt safety: - Added BigInt-safe versions of ULEB128 and SLEB128 for u64/i64 - Added WARNING about 32-bit limitations of standard JS bitwise ops - Fixed return value naming: added nextOffset alongside bytesRead LZ4 Frame format was already correctly specified in encoding-lz4.md.
Address detailed code review findings: 1. SSTable KV chunk: First entry stores only value (key from BlockMeta) 2. Delete Start IDs: Fixed types to usize/i32/isize with DeltaRle 3. ContainerArena: Uses postcard Vec (row-wise), NOT columnar encoding - Added explicit warning about the #[columnar] annotations being unused 4. DeltaOfDelta valid bits: Can be 0 (no bitstream) or 8 (full byte) 5. serde_columnar format: First varint is column count, not row count - Row count is inferred from decoded column data 6. Checksum appendix: Added "Data Checksummed" column, emphasized bytes[20..] 7. Constants section: Fixed non-existent constant names - Removed DEFAULT_SSTABLE_BLOCK_SIZE (runtime param) - Corrected SSTABLE_MAGIC to MAGIC_BYTES - Fixed source line numbers
Document lazy loading and incremental parsing capabilities: 1. SSTable block-level lazy loading - BlockMeta index for O(log n) key lookup - Block cache with LRU eviction 2. Random access via BlockMeta index - Binary search to locate blocks - Load only required blocks 3. Shallow snapshot for incremental sync - Export from specific frontier - Threshold-based state inclusion (256 ops) 4. Container-level lazy loading - Each container as separate KV entry - Load child containers on demand 5. Implementation recommendations - Block caching strategies - Deferred decompression - Parallel block loading - Incremental parsing
Fix xxHash32 document checksum example to hash bytes[20..] (encode mode + body).\n\nClarify SSTable block meta checksum excludes the initial block count.\n\nCo-authored-by: lody <[email protected]>
Add "(value)" suffix to table column headers to make it clearer that the columns show the numeric values used in each encoding context.
Review (code-checked)docs/encoding.md
docs/encoding-container-states.md
Issue: LoroValue postcard table is incorrectThe "LoroValue Encoding (in postcard)" table should match the custom binary serde implementation in Correct discriminants for postcard (binary serde):
Related: ContainerID here is postcard serde, not docs/encoding-lz4.md
docs/encoding-xxhash32.md
|
The previous table had incorrect discriminants that confused the internal LoroValueKind encoding with the postcard serde representation. Fixed to match the actual binary serde implementation in value.rs:714-739: - 0: Null - 1: Bool (with 0x00/0x01 payload, not split True/False) - 2: Double (was 3) - 3: I64 (was 4, note: variant name is historically "I32") - 4: String (was 6) - 5: List (was 7) - 6: Map (was 8) - 7: Container (was 9) - 8: Binary (was 5) Added source reference and note about the I32 variant name quirk.
- Fix MovableListState invisible_list_item semantics: items follow the visible item (AFTER), not before. Updated description and decoding logic to match source code (line 1509 shows incrementing previous record's counter) - Fix MapState JS example: properly advance past peer-count varint before reading peer table, and correctly slice remaining bytes for per-key metadata decoding - Add Unicode awareness note to RichtextState: span.len is Unicode scalar count, not UTF-16 code units; String.slice() will fail for non-BMP characters - Add counter case (type 5) to Complete Decoding Example switch
ContainerArena IS columnar-encoded, not row-wise postcard Vec: - EncodedContainer has #[columnar(vec, ser, de, iterable)] attribute - serde_columnar::to_vec() applies columnar encoding to such types - Columns: is_root (BoolRle), kind (Rle), peer_idx (Rle), key_idx_or_counter (DeltaRle) Updated section title, format diagram, and cross-reference note.
ContainerArena::encode() calls serde_columnar::to_vec(&self.containers) which serializes the raw Vec directly, not via ColumnarVec wrapper. The columnar strategies (BoolRle, Rle, DeltaRle) annotated on EncodedContainer are therefore NOT applied - it's plain postcard row-wise encoding. This corrects a previous error where the documentation claimed ContainerArena used columnar encoding with strategies. Added a note explaining why the #[columnar] attributes don't result in columnar encoding in this case.
Correct the expected hash values for the 0x4F524F4C seed: - Empty input: 0xDC3BF95A (was 0x30CFEAB0) - Single byte [0x00]: 0xDAD9F666 (was 0x71B1D100) - "loro" [0x6C,0x6F,0x72,0x6F]: 0x74D321EA (was 0x9B07EF77) - 16 bytes [0x00-0x0F]: 0x2EDAB25F (was 0xE5AA0AB4) Validated against xxhashjs reference implementation.
The EncodedListIds documentation incorrectly stated the first varint is "Number of elements (N)". With serde_columnar columnar encoding, it's actually the number of columns (3), and row count is inferred from the column data during decoding.
1. EncodedMark is postcard Vec (row-wise), not columnar - The marks field in EncodedText lacks #[columnar(class = "vec")] - Only spans has columnar encoding via that attribute 2. ContainerArena key_idx_or_counter uses zigzag varint, not SLEB128 - Row-wise postcard encoding uses zigzag for signed integers
1. Value tag 7 (ContainerType): clarify it's a container reference (index into ContainerArena), not a "type marker". The actual container type is stored in the ContainerArena entry. Source: value.rs:357,422 shows ContainerIdx(usize) encoding. 2. serde_columnar outer format: add missing column-count prefix and fix "LEB128" to "postcard varint". Without the count, decoders would misread the first varint and get out of sync.
Add detailed documentation for Loro's binary encoding format including:
Mark incomplete sections with checkboxes for future expansion.
This enables developers to implement Loro-compatible encoders/decoders
in other programming languages.