This implementation uses SQLite as the durable WAL and as the local metadata store. The goal is to:
- minimize custom file formats
- keep memory bounded under load
- simplify crash recovery (SQLite transactions)
This document specifies the intended schema and the invariants it must uphold.
Set these at startup (configurable):
PRAGMA journal_mode = WAL;PRAGMA synchronous = FULL;(safe default; allowNORMALfor benchmarks)PRAGMA foreign_keys = ON;PRAGMA temp_store = MEMORY;(optional; benchmark)PRAGMA busy_timeout = 5000;(avoid immediate SQLITE_BUSY)
Bound memory usage:
PRAGMA cache_size = -NNNN;(negative means KB; choose based on RAM budget)- consider
PRAGMA mmap_size = ...;only if you understand the memory tradeoff
Tracks migrations.
Columns:
version INTEGER NOT NULL
Invariant:
- exactly one row
One row per stream (up to 1,000,000 streams).
Columns (suggested):
stream TEXT PRIMARY KEYcreated_at_ms INTEGER NOT NULLupdated_at_ms INTEGER NOT NULL
Offsets / progress:
next_offset INTEGER NOT NULL
Next offset to assign on append. Must be monotonic.sealed_through INTEGER NOT NULL
Highest offset included in locally built segments (may not be uploaded yet).uploaded_through INTEGER NOT NULL
Highest offset made visible/durable in R2 via manifest upload.uploaded_segment_count INTEGER NOT NULL
Count of contiguous uploaded segments (prefix), used to build manifests without scanning all segments.
WAL backlog counters (to avoid expensive SUM(length(payload)) scans):
pending_rows INTEGER NOT NULLpending_bytes INTEGER NOT NULL
Segmenting hints:
last_segment_cut_ms INTEGER NOT NULLsegment_in_progress INTEGER NOT NULL(0/1)
Additional columns present in the current implementation:
- Stream protocol/config state:
content_typeprofilestream_seqclosedclosed_producer_idclosed_producer_epochclosed_producer_seqttl_seconds
- Retention/flags:
expires_at_msstream_flags
- WAL accounting:
logical_size_byteswal_rowswal_bytes
Indexes:
CREATE INDEX streams_pending_bytes_idx ON streams(pending_bytes);CREATE INDEX streams_last_cut_idx ON streams(last_segment_cut_ms);- Optional composite index for candidate selection:
(segment_in_progress, pending_bytes, last_segment_cut_ms)
Invariants:
uploaded_through <= sealed_through < next_offsetfor non-empty streams; new empty streams start withsealed_through=-1,uploaded_through=-1, andnext_offset=00 <= uploaded_segment_count <= segment_count(seestream_segment_meta)profile IS NULLorprofile='generic'is treated as agenericstream; current stream creation storesgenericpending_bytesandpending_rowsreflect unsealed WAL rows withoffset > sealed_through; sealing a segment decrements these counterslogical_size_bytesis the logical payload-byte size exposed by/_details; it is updated on append, restored from manifests for published history, and can be repaired asynchronously after bootstrap if missing.segment_in_progressmust be 0/1.
The durable write-ahead log is a single table.
Columns (suggested):
id INTEGER PRIMARY KEY(rowid; insertion order)stream TEXT NOT NULLoffset INTEGER NOT NULLts_ms INTEGER NOT NULL(ingest time)payload BLOB NOT NULLpayload_len INTEGER NOT NULL(denormalization for fast sums)
Optional columns (only if needed by protocol/indexing):
routing_key BLOB NULLcontent_type TEXT NULLflags INTEGER NOT NULL DEFAULT 0
Indexes:
CREATE UNIQUE INDEX wal_stream_offset_uniq ON wal(stream, offset);- Optional for time-based ops:
CREATE INDEX wal_ts_idx ON wal(ts_ms);
Invariants:
- for each stream,
offsetis unique and strictly increasing by protocol rules. - rows exist for offsets in
[uploaded_through, next_offset)unless GC has occurred.
Notes:
- Do not rely on SQLite
rowidas the protocol offset. Store the protocol offset explicitly.
Tracks locally built segments and upload state.
Columns:
segment_id TEXT PRIMARY KEY(stable identifier; matches object key naming rules)stream TEXT NOT NULLstart_offset INTEGER NOT NULLend_offset INTEGER NOT NULLsize_bytes INTEGER NOT NULLlocal_path TEXT NOT NULLcreated_at_ms INTEGER NOT NULLuploaded_at_ms INTEGER NULLr2_etag TEXT NULL
Indexes:
CREATE INDEX segments_stream_start_idx ON segments(stream, start_offset);CREATE INDEX segments_pending_upload_idx ON segments(uploaded_at_ms);
Invariants:
start_offset < end_offset- segments for a stream must not overlap
- a segment may exist locally without being uploaded; visibility is governed by manifest.
Compact, append‑only per‑segment arrays used to build manifests without scanning the
entire segments table. Derived state; can be rebuilt from segments.
Columns:
stream TEXT PRIMARY KEYsegment_count INTEGER NOT NULLsegment_offsets BLOB NOT NULL(u64le end_offset+1 array; length = 8*segment_count)segment_blocks BLOB NOT NULL(u32le block_count array; length = 4*segment_count)segment_last_ts BLOB NOT NULL(u64le append_ns array; length = 8*segment_count)
Invariants:
- arrays are append‑only (no rewrites on seal)
- lengths match
segment_count
Tracks current manifest generation per stream and upload state.
Columns:
stream TEXT PRIMARY KEYgeneration INTEGER NOT NULLuploaded_generation INTEGER NOT NULLlast_uploaded_at_ms INTEGER NULLlast_uploaded_etag TEXT NULLlast_uploaded_size_bytes INTEGER NULL
Invariants:
uploaded_generation <= generation- manifest upload is the “commit point” that advances
uploaded_through
Local cache of per‑stream index state. Rebuildable from manifest.
Columns:
stream TEXT PRIMARY KEYindex_secret BLOB NOT NULL(16 bytes; SipHash key)indexed_through INTEGER NOT NULL(highest segment index exclusive)updated_at_ms INTEGER NOT NULL
Invariants:
indexed_through <= segment_count
Local catalog of active index runs. Rebuildable from manifest.
Columns:
run_id TEXT PRIMARY KEYstream TEXT NOT NULLlevel INTEGER NOT NULLstart_segment INTEGER NOT NULLend_segment INTEGER NOT NULLobject_key TEXT NOT NULLsize_bytes INTEGER NOT NULLfilter_len INTEGER NOT NULLrecord_count INTEGER NOT NULLretired_gen INTEGER NULLretired_at_ms INTEGER NULL
Indexes:
CREATE INDEX index_runs_stream_idx ON index_runs(stream, level, start_segment);
Local cache of the internal exact-match secondary index family. Rebuildable from manifest.
Columns:
stream TEXT NOT NULLindex_name TEXT NOT NULLindex_secret BLOB NOT NULLindexed_through INTEGER NOT NULLupdated_at_ms INTEGER NOT NULL
Primary key:
(stream, index_name)
Local catalog of the internal exact-match secondary index family runs. Rebuildable from manifest.
Columns:
run_id TEXT PRIMARY KEYstream TEXT NOT NULLindex_name TEXT NOT NULLlevel INTEGER NOT NULLstart_segment INTEGER NOT NULLend_segment INTEGER NOT NULLobject_key TEXT NOT NULLsize_bytes INTEGER NOT NULLfilter_len INTEGER NOT NULLrecord_count INTEGER NOT NULLretired_gen INTEGER NULLretired_at_ms INTEGER NULL
Indexes:
CREATE INDEX secondary_index_runs_stream_idx ON secondary_index_runs(stream, index_name, level, start_segment);
Current implementation table (see src/db/schema.ts):
stream TEXT PRIMARY KEYschema_json TEXT NOT NULLupdated_at_ms INTEGER NOT NULLuploaded_size_bytes INTEGER NOT NULL
schema_json stores the serialized per-stream schema registry JSON (schema versions,
lenses, routingKey config, and schema-owned search declarations).
Per-stream desired bundled companion plan. Rebuildable from manifest.
Columns:
stream TEXT PRIMARY KEYgeneration INTEGER NOT NULLplan_hash TEXT NOT NULLplan_json TEXT NOT NULLupdated_at_ms INTEGER NOT NULL
Notes:
- this is the durable local record of which bundled companion generation the stream currently wants
plan_jsonstores the enabled families plus plan-relative field, rollup, interval, and measure ordinals used by the currentPSCIX2companion format- schema or profile changes that affect bundled sections increment the desired generation
Local catalog of current uploaded bundled .cix companion objects.
Rebuildable from manifest.
Columns:
stream TEXT NOT NULLsegment_index INTEGER NOT NULLobject_key TEXT NOT NULLplan_generation INTEGER NOT NULLsections_json TEXT NOT NULLsection_sizes_json TEXT NOT NULLsize_bytes INTEGER NOT NULLprimary_timestamp_min_ms INTEGER NULLprimary_timestamp_max_ms INTEGER NULLupdated_at_ms INTEGER NOT NULL
Primary key:
(stream, segment_index)
Indexes:
CREATE INDEX search_segment_companions_stream_plan_idx ON search_segment_companions(stream, plan_generation, segment_index);
Notes:
- this is a local object catalog, not a row-level search projection
- each row points at one immutable bundled
PSCIX2companion object sections_jsonrecords which bundled sections are present, such ascol,fts,agg, andmblksection_sizes_jsonrecords the byte size of each binary bundled section payload that is presentprimary_timestamp_min_ms/primary_timestamp_max_msstore the bundled companion's covered bounds for the stream's primary timestamp field when that field is available; aggregate queries use these local values to skip non-overlapping published segments without fetching any companion object- companions are published under
streams/<hash>/segments/...cix
Node-local per-stream object-store request accounting used by /_details.
Columns:
stream_hash TEXT NOT NULLartifact TEXT NOT NULLop TEXT NOT NULLcount INTEGER NOT NULLbytes INTEGER NOT NULLupdated_at_ms INTEGER NOT NULL
Primary key:
(stream_hash, artifact, op)
Indexes:
CREATE INDEX objectstore_request_counts_stream_idx ON objectstore_request_counts(stream_hash, updated_at_ms);
Notes:
- this table is local operational accounting, not durable published stream state
- counters are node-local and reflect requests observed through the current object-store wrapper
- request counts are exposed through
GET /v1/stream/{name}/_details
Stores non-generic profile configuration.
Columns:
stream TEXT PRIMARY KEYprofile_json TEXT NOT NULLupdated_at_ms INTEGER NOT NULL
Notes:
streams.profilestores the profile kind for cheap listing/filteringstream_profiles.profile_jsonstores the full JSON config for profiles such asevlogandstate-protocol- missing stored profile metadata means the stream is treated as
generic
Rebuildable helper state for touch-enabled state-protocol streams.
Columns:
stream TEXT PRIMARY KEYprocessed_through INTEGER NOT NULLupdated_at_ms INTEGER NOT NULL
Notes:
- tracks how far the background state-protocol touch worker has processed the base stream
- rows are created only for touch-enabled
state-protocolstreams - the table is rebuildable from stream metadata plus the stream contents
- it is not mirrored to object storage as exact state; bootstrap/restart reseeds it locally
Local idempotence and gap-detection state for producer-aware appends.
Columns:
stream TEXT NOT NULLproducer_id TEXT NOT NULLepoch INTEGER NOT NULLlast_seq INTEGER NOT NULLupdated_at_ms INTEGER NOT NULL
Notes:
- used for
Producer-Id/Producer-Epoch/Producer-Seqadmission checks - local-only SQLite state; not mirrored to object storage
- reset by
--bootstrap-from-r2
Runtime template registry for touch-enabled state-protocol streams.
Columns:
stream TEXT NOT NULLtemplate_id TEXT NOT NULLentity TEXT NOT NULLfields_json TEXT NOT NULLencodings_json TEXT NOT NULLstate TEXT NOT NULLcreated_at_ms INTEGER NOT NULLlast_seen_at_ms INTEGER NOT NULLinactivity_ttl_ms INTEGER NOT NULLactive_from_source_offset INTEGER NOT NULLretired_at_ms INTEGER NULLretired_reason TEXT NULL
Notes:
- runtime helper state for fine-grained live invalidation
- not part of the stream's durable published history
- not mirrored to object storage
- rebuilt or relearned by runtime traffic after restart/bootstrap
You may delete WAL rows for a stream with offset < uploaded_through only after:
- the corresponding segments are uploaded, AND
- the manifest generation that references them is uploaded successfully.
Implementation pattern:
- in one SQLite transaction:
- mark segment uploaded
- update manifest state (uploaded_generation, last etag)
- advance
uploaded_through DELETE FROM wal WHERE stream=? AND offset < ?;
- use
PRAGMA wal_checkpoint(TRUNCATE)periodically (configurable) - avoid aggressive
VACUUMon large DBs in the hot path
Never do:
SELECT stream FROM wal GROUP BY stream HAVING SUM(payload_len) > ...;(too expensive)
Do instead:
- maintain
streams.pending_bytes/pending_rowscounters at append time - query
streamstable for candidates by indexed columns
Within one transaction:
- create stream row if missing
- reserve offsets (advance next_offset)
- insert WAL rows
- update pending counters
- commit
After building segment file:
- insert segment row
- advance sealed_through (or other marker)
- decrement pending counters
- clear segment_in_progress
- commit
After upload success:
- mark segment uploaded
- update manifest state
- advance uploaded_through
- delete WAL rows below uploaded_through
- commit
Add a test module that:
- seeds random operations (append, segment, upload, crash simulation)
- checks invariants after each step
- ensures recovery logic restores invariant satisfaction