-
Notifications
You must be signed in to change notification settings - Fork 74
Description
CDB64 Root TX Index: Automated Export, Serving, and Compaction
Problem
Currently, CDB64 root TX index files must be manually generated and distributed. Operators who want to share their indexed data with other nodes have no automated way to:
- Export - Periodically generate CDB64 files from the local SQLite database
- Serve - Make CDB64 files available for other nodes to download
- Compact - Merge multiple smaller CDB64 files into larger ones to reduce query overhead
This limits the ability to create a distributed network of index sharing between AR.IO nodes.
Proposed Solution
1. HTTP Endpoint for CDB64 Distribution
Expose the existing data/cdb64-root-tx-index/ directory via HTTP, similar to the experimental datasets endpoint:
GET /local/cdb64-root-tx-index/ # Directory listing
GET /local/cdb64-root-tx-index/*.cdb64 # File download
Implementation:
- Add route using
serve-index+express.static(same pattern as datasets) - Reuse the existing directory that
Cdb64RootTxIndexalready watches - Add configuration:
ENABLE_CDB64_ENDPOINT=true/false(default: false) - Include appropriate cache headers
Benefits:
- Files served are the same files loaded for lookups
- File watching already handles hot-reload when files change
- Simple static file serving, no custom logic needed
2. Automated CDB64 Export Worker
A background worker that periodically exports the root TX index from SQLite to CDB64:
Configuration:
CDB64_EXPORT_ENABLED=true/false # Default: false
CDB64_EXPORT_INTERVAL_BLOCKS=10000 # Export every N blocks
CDB64_EXPORT_INTERVAL_MS=86400000 # OR export on time interval (daily)
CDB64_EXPORT_MIN_RECORDS=1000 # Minimum new records before exportBehavior:
- Export to
data/cdb64-root-tx-index/root-tx-index-{blockHeight}.cdb64 - Use atomic write (temp file + rename) for crash safety
- Log export duration and record count
- Trigger compaction check after export (if enabled)
Implementation:
- Leverage existing
tools/lib/export-sqlite-to-cdb64.tslogic - Create new worker in
src/workers/cdb64-export-worker.ts - Query current block height for filename versioning
3. CDB64 Compaction (Merge)
Merge multiple smaller CDB64 files into fewer larger files to reduce lookup overhead.
Why compaction matters:
Cdb64RootTxIndexsearches files in alphabetical order, first match wins- With many small files, lookups must potentially check multiple files
- A single large file provides O(1) lookup with no file iteration
Configuration:
CDB64_COMPACT_ENABLED=true/false # Default: false
CDB64_COMPACT_THRESHOLD_FILES=5 # Compact when >N files exist
CDB64_COMPACT_STRATEGY=auto/memory/external # Default: auto
CDB64_COMPACT_MEMORY_THRESHOLD=10000000 # Switch to external sort above N keysTwo merge strategies:
Strategy A: In-Memory Deduplication
Best for datasets <10M keys (~600MB memory for dedup set)
Read all files → Track seen keys in Set → Write deduplicated output
- Simple implementation
- Memory usage: O(unique keys) for dedup tracking + O(records) for writer
- Suitable for most gateway use cases
Strategy B: External Sort
For massive datasets (100M+ keys)
1. Extract: Stream each file → sorted temp chunks
2. Merge: K-way merge sorted chunks with dedup
3. Write: Stream to new CDB64
- Memory efficient: only needs buffers for K file handles
- Handles arbitrary scale
- More complex implementation
Auto strategy selection:
- Estimate key count from file sizes (~90 bytes/record average)
- Use in-memory below threshold, external sort above
Naming convention:
Before compaction:
root-tx-index-1450000.cdb64
root-tx-index-1460000.cdb64
root-tx-index-1470000.cdb64
After compaction:
root-tx-index-1450000-1470000.cdb64 # Merged file
Compaction behavior:
- Only delete source files after successful merge + verification
- Verify merged file is readable before cleanup
- Log compaction stats (files merged, keys deduplicated, size reduction)
Technical Details
CDB64 Iteration
The Cdb64Reader.entries() async generator yields key-value pairs in write order (not sorted):
for await (const { key, value } of reader.entries()) {
// Streams through file sequentially
}This is memory-efficient but means external sort requires actual sorting, not just streaming.
Memory Constraints
Two memory considerations for large merges:
-
Deduplication Set: ~60 bytes per key in Set
- 1M keys = ~60MB
- 100M keys = ~6GB
-
Cdb64Writer internals: ~32 bytes per record during
finalize()- Holds
{ hash: bigint, position: bigint }array - 100M records = ~3.2GB
- Holds
Multi-File Lookup Behavior
Current Cdb64RootTxIndex behavior with multiple files:
- Files searched in alphabetical order
- First match wins (no merging of values)
- Compaction preserves this by processing files in alphabetical order
Value Format
MessagePack encoded with compact keys:
{ r: Buffer } // Simple: just root TX ID
{ r: Buffer, i: number, d: number } // Complete: with offsetsMerge treats values as opaque bytes - no interpretation needed.
Implementation Plan
Phase 1: HTTP Endpoint
- Add route in
src/routes/for/local/cdb64-root-tx-index - Add
ENABLE_CDB64_ENDPOINTconfig option - Add to app.ts route registration
- Document in
docs/envs.md
Phase 2: Compaction Tool
- Create
src/lib/cdb64-merge.tswith merge logic - Implement in-memory strategy
- Implement external sort strategy
- Add auto-selection based on estimated size
- Create CLI tool
tools/compact-cdb64-root-tx-index - Add tests for merge correctness and deduplication
Phase 3: Export Worker
- Create
src/workers/cdb64-export-worker.ts - Integrate with block import events or timer
- Add configuration options
- Wire into
src/system.ts
Phase 4: Compaction Worker (Optional)
- Add automatic compaction triggering after exports
- Configurable thresholds
- Background execution to avoid blocking
Alternatives Considered
-
Use datasets endpoint: Initially considered serving via
/local/datasets/cdb64/, but serving the existing watched directory directly is simpler and ensures consistency between served and loaded files. -
Delete old files instead of merging: Simpler but loses historical data. Merging preserves all indexed data while reducing file count.
-
Bloom filter for dedup: Lower memory than Set but allows rare duplicates. Decided exact dedup is preferred for index integrity.
Related
- Existing export tool:
tools/export-sqlite-to-cdb64 - CDB64 format docs:
docs/cdb64-format.md - CDB64 tools docs:
docs/cdb64-tools.md - Root TX index:
src/discovery/cdb64-root-tx-index.ts