docs: add comprehensive binary encoding format documentation #898

zxch3n · 2026-01-11T03:09:28Z

Add detailed documentation for Loro's binary encoding format including:

Overall binary structure with header format
FastSnapshot and FastUpdates body formats
KV Store (SSTable) encoding details
OpLog and State encoding schemas
Change Block structure and layout
Value encoding types and tags
Compression techniques (LEB128, RLE, Delta, LZ4)

Mark incomplete sections with checkboxes for future expansion.
This enables developers to implement Loro-compatible encoders/decoders
in other programming languages.

Add detailed documentation for Loro's binary encoding format including: - Overall binary structure with header format - FastSnapshot and FastUpdates body formats - KV Store (SSTable) encoding details - OpLog and State encoding schemas - Change Block structure and layout - Value encoding types and tags - Compression techniques (LEB128, RLE, Delta, LZ4) Mark incomplete sections with checkboxes for future expansion. This enables developers to implement Loro-compatible encoders/decoders in other programming languages.

Expand encoding.md with comprehensive documentation for all encoding formats: - SSTable Block Chunk format (Normal + Large blocks) - Key-Value chunk encoding with prefix compression - VersionVector and Frontiers encoding (postcard format) - ContainerID encoding (Root vs Normal containers) - ContainerWrapper encoding structure - ContainerArena columnar encoding - PositionArena with prefix compression - Complete Value encoding for all 17 value types - serde_columnar format with strategies - Compression techniques: - LEB128 (unsigned and signed/zigzag) - BoolRle, AnyRle, DeltaRle, DeltaOfDelta - LZ4 block compression All implementation checklist items are now marked complete. This enables developers to implement Loro-compatible encoders/decoders.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf25167ef9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

docs/encoding.md

github-actions · 2026-01-11T03:13:59Z

WASM Size Report

Original size: 3186.38 KB
Gzipped size: 1006.76 KB
Brotli size: 699.37 KB

- Fix Block Key endianness: ID.to_bytes() uses big-endian, not little-endian - Clarify SSTable magic bytes format with explicit byte array notation - Fix VersionVector postcard encoding: use 'varint' instead of fixed sizes - Correct ContainerArena field name to key_idx_or_counter

The documentation incorrectly stated that signed integers use zigzag encoding before LEB128. The actual implementation uses standard signed LEB128 (SLEB128) with two's complement representation. - Clarify difference between unsigned and signed LEB128 - Add note that postcard (VersionVector, Frontiers) uses zigzag internally - Update checklist to reflect SLEB128 usage

Add specific file:line references for every encoding format section: - Block Key format (ID.to_bytes) - State KV Schema (FRONTIERS_KEY) - Operations encoding (EncodedOp struct) - Key Strings encoding - Delete Start IDs encoding - Value encoding details (read/write functions) - LoroValue, MarkStart, TreeMove, ListMove, ListSet - LEB128 signed/unsigned usage - BoolRle, AnyRle, DeltaRle, DeltaOfDelta - LZ4 compression - Endianness exceptions with sources - Checksums with sources - Constants with sources This makes it easier for reviewers to verify correctness of each claim.

For implementers in other languages (where postcard/serde_columnar crates are not available), add detailed wire format specifications: Postcard format: - Unsigned varint encoding with examples - Zigzag encoding for signed integers with decode formula - Complete type encoding table (bool, integers, floats, Option, Vec, etc.) - Important note about postcard f64 (LE) vs Loro value F64 (BE) serde_columnar format: - Overall columnar structure with row count and column lengths - BoolRle strategy with alternating true/false run counts - Rle strategy for arbitrary types - DeltaRle strategy with first value + delta runs - Concrete example using ContainerArena encoding Sources: - https://postcard.jamesmunns.com/wire-format.html - https://github.com/loro-dev/columnar

For developers implementing Loro encoding in JavaScript without any external dependencies, add: LEB128 algorithms: - Complete JavaScript encoding/decoding functions for unsigned LEB128 - Complete JavaScript encoding/decoding functions for signed LEB128 (SLEB128) - Multiple examples showing byte output for various values - Clear note differentiating SLEB128 from zigzag encoding MarkStart info byte: - Document bit layout (ALIVE, EXPAND_BEFORE, EXPAND_AFTER flags) - Common patterns for bold, link, comment styles - Source reference to TextStyleInfoFlag BoolRle improvements: - Clarify that encoding always starts with true count (can be 0) - Add multiple examples including edge cases AnyRle improvements: - Specify that values use LEB128 encoding - Different handling for u32/usize (unsigned) vs i32 (signed) - Concrete byte-level example

- BoolRle: starts with FALSE count, not TRUE count (encoder starts in "false" state). Fixed all examples. - AnyRle/Rle: format is [signed_length, value], not [value, length]. Length uses zigzag encoding and can be negative for literal runs. - DeltaOfDelta: uses sophisticated bit-packed format, not LEB128. Added complete prefix code table and decoding algorithm. - DeltaRle: referenced the DeltaRle Column Strategy section. - Added source references pointing to serde_columnar-0.3.14/src/strategy/rle.rs

Add detailed specifications for components critical to pure JavaScript implementations: - encoding-xxhash32.md: Complete xxHash32 algorithm with JS code and test vectors - encoding-lz4.md: LZ4 Frame format specification with decompression algorithm - encoding-container-states.md: Container state snapshot formats for Map, List, Text (Richtext), Tree, and MovableList containers Update encoding.md with: - Links to supplementary documentation in table of contents - Cross-references in relevant sections (checksum, state encoding, compression) - New "Supplementary Documentation" section - Extended implementation checklist with container state items

Address two critical documentation gaps: 1. ContainerType postcard serde mapping: - Document that postcard serialization uses a DIFFERENT historical mapping than ContainerID.to_bytes() (e.g., Text=0 vs Text=2) - This affects Option<ContainerID> decoding in ContainerWrapper.parent - Added comparison table in both encoding.md and encoding-container-states.md 2. Op prop field semantics: - Document how the `prop` column value is computed per container/op type - List/Text: position for Insert/Delete, start for StyleStart - MovableList: position for Insert/Delete, target for Move - Map: key index into keys arena - Tree: always 0 These were blocking issues for implementing a complete decoder.

lodyai · 2026-01-11T04:52:46Z

Review notes (goal: a dependency-free, pure-JS encoder/decoder can be written from this doc alone):

Header checksum coverage mismatch

In docs/encoding.md the checksum is described as xxHash32(body_bytes) / “xxHash32 of body” (see the header section around docs/encoding.md:51-76).
In code, the checksum is computed over bytes[20..] (i.e. encode_mode_be_u16 + body_bytes), not just the body starting at byte 22: crates/loro-internal/src/encoding.rs uses let checksum_body = &ans[20..]; and then xxh32(checksum_body, seed).
A spec-following encoder that hashes only the body will fail checksum validation.

“LZ4” needs an exact wire-format definition

The spec mentions “LZ4” for SSTable blocks (e.g. docs/encoding.md:218-249 and the “LZ4 Compression” section).
The implementation uses lz4_flex::frame::{FrameEncoder, FrameDecoder} (crates/kv-store/src/compress.rs), i.e. LZ4 frame bytes, not raw LZ4 block format.
For “no dependencies” implementers, the doc should explicitly say “LZ4 frame format” (and ideally link the exact framing spec), otherwise there’s no way to implement decompression correctly from this document alone.

JS LEB128 snippets are not BigInt-safe and offset semantics are confusing

The JS examples for ULEB128/SLEB128 (docs/encoding.md:772-850) use 32-bit bitwise ops (>>>, <<) which silently truncate >32-bit integers.
This format requires >32-bit correctness in multiple places (e.g. postcard varints for u64 PeerID; plus i64 values in the custom value encoding).
Also the decode helpers return { bytesRead: offset }, which reads like a byte-count but is actually the next offset. Consider renaming / returning both bytesRead and nextOffset and providing BigInt-safe versions (or at least clearly stating the 32-bit limitation).

With these fixes/clarifications, it’ll be much easier to confidently claim: “a pure JS implementer, with zero deps, can implement encoder/decoder from docs/encoding.md alone.”

Address review notes for pure-JS encoder/decoder implementation: 1. Checksum coverage fix: - Checksum covers bytes[20..] (encode_mode + body), NOT just body - Added explicit warning and formula clarification 2. JS LEB128 BigInt safety: - Added BigInt-safe versions of ULEB128 and SLEB128 for u64/i64 - Added WARNING about 32-bit limitations of standard JS bitwise ops - Fixed return value naming: added nextOffset alongside bytesRead LZ4 Frame format was already correctly specified in encoding-lz4.md.

Address detailed code review findings: 1. SSTable KV chunk: First entry stores only value (key from BlockMeta) 2. Delete Start IDs: Fixed types to usize/i32/isize with DeltaRle 3. ContainerArena: Uses postcard Vec (row-wise), NOT columnar encoding - Added explicit warning about the #[columnar] annotations being unused 4. DeltaOfDelta valid bits: Can be 0 (no bitstream) or 8 (full byte) 5. serde_columnar format: First varint is column count, not row count - Row count is inferred from decoded column data 6. Checksum appendix: Added "Data Checksummed" column, emphasized bytes[20..] 7. Constants section: Fixed non-existent constant names - Removed DEFAULT_SSTABLE_BLOCK_SIZE (runtime param) - Corrected SSTABLE_MAGIC to MAGIC_BYTES - Fixed source line numbers

Document lazy loading and incremental parsing capabilities: 1. SSTable block-level lazy loading - BlockMeta index for O(log n) key lookup - Block cache with LRU eviction 2. Random access via BlockMeta index - Binary search to locate blocks - Load only required blocks 3. Shallow snapshot for incremental sync - Export from specific frontier - Threshold-based state inclusion (256 ops) 4. Container-level lazy loading - Each container as separate KV entry - Load child containers on demand 5. Implementation recommendations - Block caching strategies - Deferred decompression - Parallel block loading - Incremental parsing

Fix xxHash32 document checksum example to hash bytes[20..] (encode mode + body).\n\nClarify SSTable block meta checksum excludes the initial block count.\n\nCo-authored-by: lody <[email protected]>

Add "(value)" suffix to table column headers to make it clearer that the columns show the numeric values used in each encoding context.

lodyai · 2026-01-11T06:59:49Z

Review (code-checked)

docs/encoding.md

Export Modes: matches ExportMode -> EncodeMode usage in crates/loro-internal/src/encoding.rs.
Header + checksum: matches magic "loro", 22-byte header, xxh32(bytes[20..], seed="LORO"), checksum stored at header bytes[16..20] in crates/loro-internal/src/encoding.rs.
FastSnapshot: matches [u32_le len][oplog_kv][u32_le len][state_kv or "E"][u32_le len][gc_kv] in crates/loro-internal/src/encoding/fast_snapshot.rs.
FastUpdates: matches repeated LEB128(len) + block_bytes in crates/loro-internal/src/encoding/fast_snapshot.rs / crates/loro-internal/src/oplog/change_store.rs.
KV store (SSTable): matches "LORO" magic, schema byte, block meta layout+checksum, block checksum placement, and compression flag bits in crates/kv-store/src/sstable.rs / crates/kv-store/src/block.rs.
OpLog KV schema + block key format: matches b"vv"/b"fr"/b"sv"/b"sf" and ID.to_bytes() (PeerID+Counter big-endian) in crates/loro-internal/src/oplog/change_store.rs and crates/loro-common/src/lib.rs.
State encoding: matches ContainerID binary layout and ContainerWrapper layout (including the postcard historical ContainerType mapping caveat for parent) in crates/loro-common/src/lib.rs and crates/loro-internal/src/state/container_store/container_wrapper.rs.
Change block encoding: structure matches crates/loro-internal/src/oplog/change_store/block_encode.rs and header/meta matches crates/loro-internal/src/oplog/change_store/block_meta_encode.rs.
Value encoding: tag mapping matches crates/loro-internal/src/encoding/value.rs, and prop semantics match crates/loro-internal/src/encoding/outdated_encode_reordered.rs.

docs/encoding-container-states.md

ContainerWrapper format + dual ContainerType mapping warning: matches crates/loro-internal/src/state/container_store/container_wrapper.rs and crates/loro-common/src/lib.rs.
Map/List/Richtext/Tree/MovableList/Counter snapshot formats match the encode_snapshot_fast / decode_snapshot_fast implementations in:
- crates/loro-internal/src/state/map_state.rs
- crates/loro-internal/src/state/list_state.rs
- crates/loro-internal/src/state/richtext_state.rs
- crates/loro-internal/src/state/tree_state.rs
- crates/loro-internal/src/state/movable_list_state.rs
- crates/loro-internal/src/state/counter_state.rs

Issue: LoroValue postcard table is incorrect

The "LoroValue Encoding (in postcard)" table should match the custom binary serde implementation in crates/loro-common/src/value.rs.

Correct discriminants for postcard (binary serde):

0: Null
1: Bool (payload: bool 0x00/0x01)
2: Double (payload: f64, little-endian)
3: I64 (payload: i64, zigzag varint; variant name is historically "I32")
4: String
5: List
6: Map
7: Container (payload: ContainerID via postcard serde; ContainerType uses postcard historical mapping)
8: Binary

Related: ContainerID here is postcard serde, not ContainerID.to_bytes().

docs/encoding-lz4.md

Matches Loro usage of LZ4 frame via lz4_flex::frame in crates/kv-store/src/compress.rs.

docs/encoding-xxhash32.md

Seed 0x4F524F4C and checksum coverage match crates/loro-internal/src/encoding.rs and crates/kv-store/src/sstable.rs / crates/kv-store/src/block.rs.

The previous table had incorrect discriminants that confused the internal LoroValueKind encoding with the postcard serde representation. Fixed to match the actual binary serde implementation in value.rs:714-739: - 0: Null - 1: Bool (with 0x00/0x01 payload, not split True/False) - 2: Double (was 3) - 3: I64 (was 4, note: variant name is historically "I32") - 4: String (was 6) - 5: List (was 7) - 6: Map (was 8) - 7: Container (was 9) - 8: Binary (was 5) Added source reference and note about the I32 variant name quirk.

- Fix MovableListState invisible_list_item semantics: items follow the visible item (AFTER), not before. Updated description and decoding logic to match source code (line 1509 shows incrementing previous record's counter) - Fix MapState JS example: properly advance past peer-count varint before reading peer table, and correctly slice remaining bytes for per-key metadata decoding - Add Unicode awareness note to RichtextState: span.len is Unicode scalar count, not UTF-16 code units; String.slice() will fail for non-BMP characters - Add counter case (type 5) to Complete Decoding Example switch

ContainerArena IS columnar-encoded, not row-wise postcard Vec: - EncodedContainer has #[columnar(vec, ser, de, iterable)] attribute - serde_columnar::to_vec() applies columnar encoding to such types - Columns: is_root (BoolRle), kind (Rle), peer_idx (Rle), key_idx_or_counter (DeltaRle) Updated section title, format diagram, and cross-reference note.

ContainerArena::encode() calls serde_columnar::to_vec(&self.containers) which serializes the raw Vec directly, not via ColumnarVec wrapper. The columnar strategies (BoolRle, Rle, DeltaRle) annotated on EncodedContainer are therefore NOT applied - it's plain postcard row-wise encoding. This corrects a previous error where the documentation claimed ContainerArena used columnar encoding with strategies. Added a note explaining why the #[columnar] attributes don't result in columnar encoding in this case.

Correct the expected hash values for the 0x4F524F4C seed: - Empty input: 0xDC3BF95A (was 0x30CFEAB0) - Single byte [0x00]: 0xDAD9F666 (was 0x71B1D100) - "loro" [0x6C,0x6F,0x72,0x6F]: 0x74D321EA (was 0x9B07EF77) - 16 bytes [0x00-0x0F]: 0x2EDAB25F (was 0xE5AA0AB4) Validated against xxhashjs reference implementation.

The EncodedListIds documentation incorrectly stated the first varint is "Number of elements (N)". With serde_columnar columnar encoding, it's actually the number of columns (3), and row count is inferred from the column data during decoding.

1. EncodedMark is postcard Vec (row-wise), not columnar - The marks field in EncodedText lacks #[columnar(class = "vec")] - Only spans has columnar encoding via that attribute 2. ContainerArena key_idx_or_counter uses zigzag varint, not SLEB128 - Row-wise postcard encoding uses zigzag for signed integers

1. Value tag 7 (ContainerType): clarify it's a container reference (index into ContainerArena), not a "type marker". The actual container type is stored in the ContainerArena entry. Source: value.rs:357,422 shows ContainerIdx(usize) encoding. 2. serde_columnar outer format: add missing column-count prefix and fix "LEB128" to "postcard varint". Without the count, decoders would misread the first varint and get out of sync.

claude added 2 commits January 11, 2026 02:46

chatgpt-codex-connector bot reviewed Jan 11, 2026

View reviewed changes

docs/encoding.md Outdated Show resolved Hide resolved

claude added 8 commits January 11, 2026 03:15

claude added 3 commits January 11, 2026 04:57

lodyai bot mentioned this pull request Jan 11, 2026

docs: clarify checksum ranges in encoding spec #900

Merged

Codex CLI and others added 2 commits January 11, 2026 13:52

docs: clarify checksum ranges in encoding spec

17b0dff

Fix xxHash32 document checksum example to hash bytes[20..] (encode mode + body).\n\nClarify SSTable block meta checksum excludes the initial block count.\n\nCo-authored-by: lody <[email protected]>

docs: clarify ContainerType mapping table headers

025a2c5

Add "(value)" suffix to table column headers to make it clearer that the columns show the numeric values used in each encoding context.

claude added 3 commits January 11, 2026 07:39

lodyai bot mentioned this pull request Jan 11, 2026

docs: fix encoding spec correctness (follow-up to #898) #901

Closed

claude added 5 commits January 11, 2026 09:57

lodyai bot mentioned this pull request Jan 13, 2026

docs: fix xxHash32 LORO-seed decimal comments #902

Merged

docs: fix encoding spec inaccuracies

9800b77

zxch3n merged commit 662a71b into main Jan 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

docs: add comprehensive binary encoding format documentation #898

docs: add comprehensive binary encoding format documentation #898

zxch3n commented Jan 11, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Jan 11, 2026

Uh oh!

lodyai bot commented Jan 11, 2026

Uh oh!

lodyai bot commented Jan 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

docs: add comprehensive binary encoding format documentation #898

docs: add comprehensive binary encoding format documentation #898

Conversation

zxch3n commented Jan 11, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions bot commented Jan 11, 2026

WASM Size Report

Uh oh!

lodyai bot commented Jan 11, 2026

Uh oh!

lodyai bot commented Jan 11, 2026

Review (code-checked)

docs/encoding.md

docs/encoding-container-states.md

Issue: LoroValue postcard table is incorrect

docs/encoding-lz4.md

docs/encoding-xxhash32.md

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants