RFC: MVCC Storage Layout for Immutable Segments

Status: Accepted
Authors: Tonbo storage team
Created: 2025-10-28
Updated: 2025-11
Area: Storage engine, MVCC, SSTable

Summary

Define the on-disk layout for MVCC metadata in immutable segments and SSTables. Data files embed _commit_ts as an appended column, while tombstones are stored in a separate key-only delete sidecar.

Motivation

Keep user schema columns cleanly separated from MVCC system columns
Avoid materializing tombstoned rows as null-filled payloads in the data file
Enable efficient column pruning—scans that don't need visibility checks can skip _commit_ts
Support delete-heavy workloads without bloating data files

Goals

Immutable segments and SST files store _commit_ts alongside user data for upserts
Tombstones live in a dedicated key-only sidecar, not in the data file
WAL payloads remain Arrow-native and replayable
Compaction, recovery, and GC operate over append-only objects via Fusio
Minimal read amplification for scans and range lookups

Non-Goals

Changing timestamp assignment strategy (remains monotonic per commit)
Introducing new transaction semantics or cross-table coordination
Typed/compile-time ingestion pathways (future work per RFC 0001)

Design

Storage Layout

Each SSTable consists of up to two files:

/sst/L{level}/{id}.parquet          # data + _commit_ts column
/sst/L{level}/{id}.delete.parquet   # key-only delete sidecar (when tombstones exist)

Data File

The data Parquet contains user schema columns with _commit_ts: UInt64 appended as the last column. This enables:

Column pruning to skip _commit_ts when visibility filtering is not needed
Single-file I/O for upsert-only segments
Alignment with WAL frame layout

Delete Sidecar

Schema: <primary-key columns>, _commit_ts: UInt64 (non-null)

Only emitted when the segment contains tombstones
Key-only format avoids storing null value columns for deleted rows
Enables efficient tombstone lookup during scans without polluting the data file

WAL Integration

Upsert frames carry batches with _commit_ts column
Delete frames carry key-only batches for tombstones
Replay reconstructs both upsert data and tombstone metadata

Read Path

Load data Parquet, extracting _commit_ts for visibility checks
Load delete sidecar if present
Range scans consult both for MVCC filtering
Callers never see _commit_ts in query results—projection excludes it

Compaction

Merge data files with latest-wins semantics based on _commit_ts
Merge delete sidecars alongside data
Tombstones may be pruned when commit_ts <= tombstone_watermark and no live versions exist

GC

Manifest-driven GC treats data and delete files atomically—both are retained or removed together based on version visibility.

Alternatives Considered

Three-file layout (data + mvcc sidecar + delete sidecar): More files increase I/O; Parquet column pruning already handles skipping _commit_ts.
Embed tombstones in data file as null rows: Wastes space, forces nullable schemas, complicates scans.
Store MVCC in Parquet metadata: Lacks per-row fidelity, complicates streaming reads.
Tombstones in WAL only: Makes SST reconstruction expensive and breaks crash recovery.

Comparison with Other Systems

Apache Iceberg

Iceberg has three delete mechanisms:

Type	How it works	Pros	Cons
Position deletes	file_path + row position	Fast read (exact location)	Requires knowing physical position; deprecated in v3
Equality deletes	Column values that identify rows	Write-friendly (no position needed)	Query penalty—must scan and filter
Deletion vectors (v3)	Bitmap per data file	Compact, fast read	Requires file-level tracking

Tonbo's key-only delete sidecar is similar to Iceberg's equality deletes—identifying rows by key values rather than physical position. Iceberg's experience shows equality deletes accumulate and hurt read performance until merged, which informed our decision to track tombstone watermarks for pruning.

RocksDB

RocksDB takes a different approach:

Point tombstones: Stored inline with data in SST files (key + sequence number)
Range tombstones: Dedicated meta-block within each SST file
Compaction: Tombstones drop only at bottom level when no snapshot references them
Trigger heuristics: Compaction triggered when tombstone ratio exceeds 50%

RocksDB's inline storage avoids extra files but complicates the data format. Their range tombstone support is something Tonbo currently lacks.

Design Trade-offs

Tonbo's current design optimizes for:

Simplicity: Separate file is easier to reason about than inline storage
Write path: No need to know physical row positions (unlike position deletes)
Schema purity: User data files remain uncontaminated by tombstone markers

Known limitations for future consideration:

File proliferation: Every segment with tombstones creates an extra file; deletion vectors (bitmaps) would be more compact
No range deletes: Point-delete-only; range tombstone support may be needed for bulk delete workloads
Watermark semantics: Tombstone retention must coordinate with snapshot/reader registry to avoid premature pruning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: MVCC Storage Layout for Immutable Segments

Summary

Motivation

Goals

Non-Goals

Design

Storage Layout

Data File

Delete Sidecar

WAL Integration

Read Path

Compaction

GC

Alternatives Considered

Comparison with Other Systems

Apache Iceberg

RocksDB

Design Trade-offs

FilesExpand file tree

0006-mvcc-sidecar.md

Latest commit

History

0006-mvcc-sidecar.md

File metadata and controls

RFC: MVCC Storage Layout for Immutable Segments

Summary

Motivation

Goals

Non-Goals

Design

Storage Layout

Data File

Delete Sidecar

WAL Integration

Read Path

Compaction

GC

Alternatives Considered

Comparison with Other Systems

Apache Iceberg

RocksDB

Design Trade-offs