Skip to content

Conversation

@zxch3n
Copy link
Member

@zxch3n zxch3n commented Jan 11, 2026

Add detailed documentation for Loro's binary encoding format including:

  • Overall binary structure with header format
  • FastSnapshot and FastUpdates body formats
  • KV Store (SSTable) encoding details
  • OpLog and State encoding schemas
  • Change Block structure and layout
  • Value encoding types and tags
  • Compression techniques (LEB128, RLE, Delta, LZ4)

Mark incomplete sections with checkboxes for future expansion.
This enables developers to implement Loro-compatible encoders/decoders
in other programming languages.

Add detailed documentation for Loro's binary encoding format including:
- Overall binary structure with header format
- FastSnapshot and FastUpdates body formats
- KV Store (SSTable) encoding details
- OpLog and State encoding schemas
- Change Block structure and layout
- Value encoding types and tags
- Compression techniques (LEB128, RLE, Delta, LZ4)

Mark incomplete sections with checkboxes for future expansion.
This enables developers to implement Loro-compatible encoders/decoders
in other programming languages.
Expand encoding.md with comprehensive documentation for all encoding formats:

- SSTable Block Chunk format (Normal + Large blocks)
- Key-Value chunk encoding with prefix compression
- VersionVector and Frontiers encoding (postcard format)
- ContainerID encoding (Root vs Normal containers)
- ContainerWrapper encoding structure
- ContainerArena columnar encoding
- PositionArena with prefix compression
- Complete Value encoding for all 17 value types
- serde_columnar format with strategies
- Compression techniques:
  - LEB128 (unsigned and signed/zigzag)
  - BoolRle, AnyRle, DeltaRle, DeltaOfDelta
  - LZ4 block compression

All implementation checklist items are now marked complete.
This enables developers to implement Loro-compatible encoders/decoders.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf25167ef9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@github-actions
Copy link
Contributor

WASM Size Report

  • Original size: 3186.38 KB
  • Gzipped size: 1006.76 KB
  • Brotli size: 699.37 KB

- Fix Block Key endianness: ID.to_bytes() uses big-endian, not little-endian
- Clarify SSTable magic bytes format with explicit byte array notation
- Fix VersionVector postcard encoding: use 'varint' instead of fixed sizes
- Correct ContainerArena field name to key_idx_or_counter
The documentation incorrectly stated that signed integers use zigzag
encoding before LEB128. The actual implementation uses standard signed
LEB128 (SLEB128) with two's complement representation.

- Clarify difference between unsigned and signed LEB128
- Add note that postcard (VersionVector, Frontiers) uses zigzag internally
- Update checklist to reflect SLEB128 usage
Add specific file:line references for every encoding format section:
- Block Key format (ID.to_bytes)
- State KV Schema (FRONTIERS_KEY)
- Operations encoding (EncodedOp struct)
- Key Strings encoding
- Delete Start IDs encoding
- Value encoding details (read/write functions)
- LoroValue, MarkStart, TreeMove, ListMove, ListSet
- LEB128 signed/unsigned usage
- BoolRle, AnyRle, DeltaRle, DeltaOfDelta
- LZ4 compression
- Endianness exceptions with sources
- Checksums with sources
- Constants with sources

This makes it easier for reviewers to verify correctness of each claim.
For implementers in other languages (where postcard/serde_columnar crates
are not available), add detailed wire format specifications:

Postcard format:
- Unsigned varint encoding with examples
- Zigzag encoding for signed integers with decode formula
- Complete type encoding table (bool, integers, floats, Option, Vec, etc.)
- Important note about postcard f64 (LE) vs Loro value F64 (BE)

serde_columnar format:
- Overall columnar structure with row count and column lengths
- BoolRle strategy with alternating true/false run counts
- Rle strategy for arbitrary types
- DeltaRle strategy with first value + delta runs
- Concrete example using ContainerArena encoding

Sources:
- https://postcard.jamesmunns.com/wire-format.html
- https://github.com/loro-dev/columnar
For developers implementing Loro encoding in JavaScript without any
external dependencies, add:

LEB128 algorithms:
- Complete JavaScript encoding/decoding functions for unsigned LEB128
- Complete JavaScript encoding/decoding functions for signed LEB128 (SLEB128)
- Multiple examples showing byte output for various values
- Clear note differentiating SLEB128 from zigzag encoding

MarkStart info byte:
- Document bit layout (ALIVE, EXPAND_BEFORE, EXPAND_AFTER flags)
- Common patterns for bold, link, comment styles
- Source reference to TextStyleInfoFlag

BoolRle improvements:
- Clarify that encoding always starts with true count (can be 0)
- Add multiple examples including edge cases

AnyRle improvements:
- Specify that values use LEB128 encoding
- Different handling for u32/usize (unsigned) vs i32 (signed)
- Concrete byte-level example
- BoolRle: starts with FALSE count, not TRUE count (encoder starts
  in "false" state). Fixed all examples.
- AnyRle/Rle: format is [signed_length, value], not [value, length].
  Length uses zigzag encoding and can be negative for literal runs.
- DeltaOfDelta: uses sophisticated bit-packed format, not LEB128.
  Added complete prefix code table and decoding algorithm.
- DeltaRle: referenced the DeltaRle Column Strategy section.
- Added source references pointing to serde_columnar-0.3.14/src/strategy/rle.rs
Add detailed specifications for components critical to pure JavaScript implementations:

- encoding-xxhash32.md: Complete xxHash32 algorithm with JS code and test vectors
- encoding-lz4.md: LZ4 Frame format specification with decompression algorithm
- encoding-container-states.md: Container state snapshot formats for Map, List,
  Text (Richtext), Tree, and MovableList containers

Update encoding.md with:
- Links to supplementary documentation in table of contents
- Cross-references in relevant sections (checksum, state encoding, compression)
- New "Supplementary Documentation" section
- Extended implementation checklist with container state items
Address two critical documentation gaps:

1. ContainerType postcard serde mapping:
   - Document that postcard serialization uses a DIFFERENT historical mapping
     than ContainerID.to_bytes() (e.g., Text=0 vs Text=2)
   - This affects Option<ContainerID> decoding in ContainerWrapper.parent
   - Added comparison table in both encoding.md and encoding-container-states.md

2. Op prop field semantics:
   - Document how the `prop` column value is computed per container/op type
   - List/Text: position for Insert/Delete, start for StyleStart
   - MovableList: position for Insert/Delete, target for Move
   - Map: key index into keys arena
   - Tree: always 0

These were blocking issues for implementing a complete decoder.
@lodyai
Copy link

lodyai bot commented Jan 11, 2026

Review notes (goal: a dependency-free, pure-JS encoder/decoder can be written from this doc alone):

  1. Header checksum coverage mismatch
  • In docs/encoding.md the checksum is described as xxHash32(body_bytes) / “xxHash32 of body” (see the header section around docs/encoding.md:51-76).
  • In code, the checksum is computed over bytes[20..] (i.e. encode_mode_be_u16 + body_bytes), not just the body starting at byte 22: crates/loro-internal/src/encoding.rs uses let checksum_body = &ans[20..]; and then xxh32(checksum_body, seed).
  • A spec-following encoder that hashes only the body will fail checksum validation.
  1. “LZ4” needs an exact wire-format definition
  • The spec mentions “LZ4” for SSTable blocks (e.g. docs/encoding.md:218-249 and the “LZ4 Compression” section).
  • The implementation uses lz4_flex::frame::{FrameEncoder, FrameDecoder} (crates/kv-store/src/compress.rs), i.e. LZ4 frame bytes, not raw LZ4 block format.
  • For “no dependencies” implementers, the doc should explicitly say “LZ4 frame format” (and ideally link the exact framing spec), otherwise there’s no way to implement decompression correctly from this document alone.
  1. JS LEB128 snippets are not BigInt-safe and offset semantics are confusing
  • The JS examples for ULEB128/SLEB128 (docs/encoding.md:772-850) use 32-bit bitwise ops (>>>, <<) which silently truncate >32-bit integers.
  • This format requires >32-bit correctness in multiple places (e.g. postcard varints for u64 PeerID; plus i64 values in the custom value encoding).
  • Also the decode helpers return { bytesRead: offset }, which reads like a byte-count but is actually the next offset. Consider renaming / returning both bytesRead and nextOffset and providing BigInt-safe versions (or at least clearly stating the 32-bit limitation).

With these fixes/clarifications, it’ll be much easier to confidently claim: “a pure JS implementer, with zero deps, can implement encoder/decoder from docs/encoding.md alone.”

Address review notes for pure-JS encoder/decoder implementation:

1. Checksum coverage fix:
   - Checksum covers bytes[20..] (encode_mode + body), NOT just body
   - Added explicit warning and formula clarification

2. JS LEB128 BigInt safety:
   - Added BigInt-safe versions of ULEB128 and SLEB128 for u64/i64
   - Added WARNING about 32-bit limitations of standard JS bitwise ops
   - Fixed return value naming: added nextOffset alongside bytesRead

LZ4 Frame format was already correctly specified in encoding-lz4.md.
Address detailed code review findings:

1. SSTable KV chunk: First entry stores only value (key from BlockMeta)
2. Delete Start IDs: Fixed types to usize/i32/isize with DeltaRle
3. ContainerArena: Uses postcard Vec (row-wise), NOT columnar encoding
   - Added explicit warning about the #[columnar] annotations being unused
4. DeltaOfDelta valid bits: Can be 0 (no bitstream) or 8 (full byte)
5. serde_columnar format: First varint is column count, not row count
   - Row count is inferred from decoded column data
6. Checksum appendix: Added "Data Checksummed" column, emphasized bytes[20..]
7. Constants section: Fixed non-existent constant names
   - Removed DEFAULT_SSTABLE_BLOCK_SIZE (runtime param)
   - Corrected SSTABLE_MAGIC to MAGIC_BYTES
   - Fixed source line numbers
Document lazy loading and incremental parsing capabilities:

1. SSTable block-level lazy loading
   - BlockMeta index for O(log n) key lookup
   - Block cache with LRU eviction

2. Random access via BlockMeta index
   - Binary search to locate blocks
   - Load only required blocks

3. Shallow snapshot for incremental sync
   - Export from specific frontier
   - Threshold-based state inclusion (256 ops)

4. Container-level lazy loading
   - Each container as separate KV entry
   - Load child containers on demand

5. Implementation recommendations
   - Block caching strategies
   - Deferred decompression
   - Parallel block loading
   - Incremental parsing
Codex CLI and others added 2 commits January 11, 2026 13:52
Fix xxHash32 document checksum example to hash bytes[20..] (encode mode + body).\n\nClarify SSTable block meta checksum excludes the initial block count.\n\nCo-authored-by: lody <[email protected]>
Add "(value)" suffix to table column headers to make it clearer that
the columns show the numeric values used in each encoding context.
@lodyai
Copy link

lodyai bot commented Jan 11, 2026

Review (code-checked)

docs/encoding.md

  • Export Modes: matches ExportMode -> EncodeMode usage in crates/loro-internal/src/encoding.rs.
  • Header + checksum: matches magic "loro", 22-byte header, xxh32(bytes[20..], seed="LORO"), checksum stored at header bytes[16..20] in crates/loro-internal/src/encoding.rs.
  • FastSnapshot: matches [u32_le len][oplog_kv][u32_le len][state_kv or "E"][u32_le len][gc_kv] in crates/loro-internal/src/encoding/fast_snapshot.rs.
  • FastUpdates: matches repeated LEB128(len) + block_bytes in crates/loro-internal/src/encoding/fast_snapshot.rs / crates/loro-internal/src/oplog/change_store.rs.
  • KV store (SSTable): matches "LORO" magic, schema byte, block meta layout+checksum, block checksum placement, and compression flag bits in crates/kv-store/src/sstable.rs / crates/kv-store/src/block.rs.
  • OpLog KV schema + block key format: matches b"vv"/b"fr"/b"sv"/b"sf" and ID.to_bytes() (PeerID+Counter big-endian) in crates/loro-internal/src/oplog/change_store.rs and crates/loro-common/src/lib.rs.
  • State encoding: matches ContainerID binary layout and ContainerWrapper layout (including the postcard historical ContainerType mapping caveat for parent) in crates/loro-common/src/lib.rs and crates/loro-internal/src/state/container_store/container_wrapper.rs.
  • Change block encoding: structure matches crates/loro-internal/src/oplog/change_store/block_encode.rs and header/meta matches crates/loro-internal/src/oplog/change_store/block_meta_encode.rs.
  • Value encoding: tag mapping matches crates/loro-internal/src/encoding/value.rs, and prop semantics match crates/loro-internal/src/encoding/outdated_encode_reordered.rs.

docs/encoding-container-states.md

  • ContainerWrapper format + dual ContainerType mapping warning: matches crates/loro-internal/src/state/container_store/container_wrapper.rs and crates/loro-common/src/lib.rs.
  • Map/List/Richtext/Tree/MovableList/Counter snapshot formats match the encode_snapshot_fast / decode_snapshot_fast implementations in:
    • crates/loro-internal/src/state/map_state.rs
    • crates/loro-internal/src/state/list_state.rs
    • crates/loro-internal/src/state/richtext_state.rs
    • crates/loro-internal/src/state/tree_state.rs
    • crates/loro-internal/src/state/movable_list_state.rs
    • crates/loro-internal/src/state/counter_state.rs

Issue: LoroValue postcard table is incorrect

The "LoroValue Encoding (in postcard)" table should match the custom binary serde implementation in crates/loro-common/src/value.rs.

Correct discriminants for postcard (binary serde):

  • 0: Null
  • 1: Bool (payload: bool 0x00/0x01)
  • 2: Double (payload: f64, little-endian)
  • 3: I64 (payload: i64, zigzag varint; variant name is historically "I32")
  • 4: String
  • 5: List
  • 6: Map
  • 7: Container (payload: ContainerID via postcard serde; ContainerType uses postcard historical mapping)
  • 8: Binary

Related: ContainerID here is postcard serde, not ContainerID.to_bytes().

docs/encoding-lz4.md

  • Matches Loro usage of LZ4 frame via lz4_flex::frame in crates/kv-store/src/compress.rs.

docs/encoding-xxhash32.md

  • Seed 0x4F524F4C and checksum coverage match crates/loro-internal/src/encoding.rs and crates/kv-store/src/sstable.rs / crates/kv-store/src/block.rs.

The previous table had incorrect discriminants that confused the internal
LoroValueKind encoding with the postcard serde representation. Fixed to
match the actual binary serde implementation in value.rs:714-739:

- 0: Null
- 1: Bool (with 0x00/0x01 payload, not split True/False)
- 2: Double (was 3)
- 3: I64 (was 4, note: variant name is historically "I32")
- 4: String (was 6)
- 5: List (was 7)
- 6: Map (was 8)
- 7: Container (was 9)
- 8: Binary (was 5)

Added source reference and note about the I32 variant name quirk.
- Fix MovableListState invisible_list_item semantics: items follow the
  visible item (AFTER), not before. Updated description and decoding
  logic to match source code (line 1509 shows incrementing previous
  record's counter)
- Fix MapState JS example: properly advance past peer-count varint
  before reading peer table, and correctly slice remaining bytes
  for per-key metadata decoding
- Add Unicode awareness note to RichtextState: span.len is Unicode
  scalar count, not UTF-16 code units; String.slice() will fail
  for non-BMP characters
- Add counter case (type 5) to Complete Decoding Example switch
ContainerArena IS columnar-encoded, not row-wise postcard Vec:
- EncodedContainer has #[columnar(vec, ser, de, iterable)] attribute
- serde_columnar::to_vec() applies columnar encoding to such types
- Columns: is_root (BoolRle), kind (Rle), peer_idx (Rle),
  key_idx_or_counter (DeltaRle)

Updated section title, format diagram, and cross-reference note.
ContainerArena::encode() calls serde_columnar::to_vec(&self.containers)
which serializes the raw Vec directly, not via ColumnarVec wrapper. The
columnar strategies (BoolRle, Rle, DeltaRle) annotated on EncodedContainer
are therefore NOT applied - it's plain postcard row-wise encoding.

This corrects a previous error where the documentation claimed ContainerArena
used columnar encoding with strategies. Added a note explaining why the
#[columnar] attributes don't result in columnar encoding in this case.
Correct the expected hash values for the 0x4F524F4C seed:
- Empty input: 0xDC3BF95A (was 0x30CFEAB0)
- Single byte [0x00]: 0xDAD9F666 (was 0x71B1D100)
- "loro" [0x6C,0x6F,0x72,0x6F]: 0x74D321EA (was 0x9B07EF77)
- 16 bytes [0x00-0x0F]: 0x2EDAB25F (was 0xE5AA0AB4)

Validated against xxhashjs reference implementation.
The EncodedListIds documentation incorrectly stated the first varint is
"Number of elements (N)". With serde_columnar columnar encoding, it's
actually the number of columns (3), and row count is inferred from
the column data during decoding.
1. EncodedMark is postcard Vec (row-wise), not columnar
   - The marks field in EncodedText lacks #[columnar(class = "vec")]
   - Only spans has columnar encoding via that attribute

2. ContainerArena key_idx_or_counter uses zigzag varint, not SLEB128
   - Row-wise postcard encoding uses zigzag for signed integers
1. Value tag 7 (ContainerType): clarify it's a container reference
   (index into ContainerArena), not a "type marker". The actual
   container type is stored in the ContainerArena entry.
   Source: value.rs:357,422 shows ContainerIdx(usize) encoding.

2. serde_columnar outer format: add missing column-count prefix and
   fix "LEB128" to "postcard varint". Without the count, decoders
   would misread the first varint and get out of sync.
@zxch3n zxch3n merged commit 662a71b into main Jan 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants