- Status: Accepted
- Authors: Tonbo storage team
- Created: 2025-10-28
- Updated: 2025-11
- Area: Storage engine, MVCC, SSTable
Define the on-disk layout for MVCC metadata in immutable segments and SSTables. Data files embed _commit_ts as an appended column, while tombstones are stored in a separate key-only delete sidecar.
- Keep user schema columns cleanly separated from MVCC system columns
- Avoid materializing tombstoned rows as null-filled payloads in the data file
- Enable efficient column pruning—scans that don't need visibility checks can skip
_commit_ts - Support delete-heavy workloads without bloating data files
- Immutable segments and SST files store
_commit_tsalongside user data for upserts - Tombstones live in a dedicated key-only sidecar, not in the data file
- WAL payloads remain Arrow-native and replayable
- Compaction, recovery, and GC operate over append-only objects via Fusio
- Minimal read amplification for scans and range lookups
- Changing timestamp assignment strategy (remains monotonic per commit)
- Introducing new transaction semantics or cross-table coordination
- Typed/compile-time ingestion pathways (future work per RFC 0001)
Each SSTable consists of up to two files:
/sst/L{level}/{id}.parquet # data + _commit_ts column
/sst/L{level}/{id}.delete.parquet # key-only delete sidecar (when tombstones exist)
The data Parquet contains user schema columns with _commit_ts: UInt64 appended as the last column. This enables:
- Column pruning to skip
_commit_tswhen visibility filtering is not needed - Single-file I/O for upsert-only segments
- Alignment with WAL frame layout
Schema: <primary-key columns>, _commit_ts: UInt64 (non-null)
- Only emitted when the segment contains tombstones
- Key-only format avoids storing null value columns for deleted rows
- Enables efficient tombstone lookup during scans without polluting the data file
- Upsert frames carry batches with
_commit_tscolumn - Delete frames carry key-only batches for tombstones
- Replay reconstructs both upsert data and tombstone metadata
- Load data Parquet, extracting
_commit_tsfor visibility checks - Load delete sidecar if present
- Range scans consult both for MVCC filtering
- Callers never see
_commit_tsin query results—projection excludes it
- Merge data files with latest-wins semantics based on
_commit_ts - Merge delete sidecars alongside data
- Tombstones may be pruned when
commit_ts <= tombstone_watermarkand no live versions exist
Manifest-driven GC treats data and delete files atomically—both are retained or removed together based on version visibility.
-
Three-file layout (data + mvcc sidecar + delete sidecar): More files increase I/O; Parquet column pruning already handles skipping
_commit_ts. -
Embed tombstones in data file as null rows: Wastes space, forces nullable schemas, complicates scans.
-
Store MVCC in Parquet metadata: Lacks per-row fidelity, complicates streaming reads.
-
Tombstones in WAL only: Makes SST reconstruction expensive and breaks crash recovery.
Iceberg has three delete mechanisms:
| Type | How it works | Pros | Cons |
|---|---|---|---|
| Position deletes | file_path + row position | Fast read (exact location) | Requires knowing physical position; deprecated in v3 |
| Equality deletes | Column values that identify rows | Write-friendly (no position needed) | Query penalty—must scan and filter |
| Deletion vectors (v3) | Bitmap per data file | Compact, fast read | Requires file-level tracking |
Tonbo's key-only delete sidecar is similar to Iceberg's equality deletes—identifying rows by key values rather than physical position. Iceberg's experience shows equality deletes accumulate and hurt read performance until merged, which informed our decision to track tombstone watermarks for pruning.
RocksDB takes a different approach:
- Point tombstones: Stored inline with data in SST files (key + sequence number)
- Range tombstones: Dedicated meta-block within each SST file
- Compaction: Tombstones drop only at bottom level when no snapshot references them
- Trigger heuristics: Compaction triggered when tombstone ratio exceeds 50%
RocksDB's inline storage avoids extra files but complicates the data format. Their range tombstone support is something Tonbo currently lacks.
Tonbo's current design optimizes for:
- Simplicity: Separate file is easier to reason about than inline storage
- Write path: No need to know physical row positions (unlike position deletes)
- Schema purity: User data files remain uncontaminated by tombstone markers
Known limitations for future consideration:
- File proliferation: Every segment with tombstones creates an extra file; deletion vectors (bitmaps) would be more compact
- No range deletes: Point-delete-only; range tombstone support may be needed for bulk delete workloads
- Watermark semantics: Tombstone retention must coordinate with snapshot/reader registry to avoid premature pruning