Skip to content

CDB64 Root TX Index: Automated Export, Serving, and Compaction #570

@djwhitt

Description

@djwhitt

CDB64 Root TX Index: Automated Export, Serving, and Compaction

Problem

Currently, CDB64 root TX index files must be manually generated and distributed. Operators who want to share their indexed data with other nodes have no automated way to:

  1. Export - Periodically generate CDB64 files from the local SQLite database
  2. Serve - Make CDB64 files available for other nodes to download
  3. Compact - Merge multiple smaller CDB64 files into larger ones to reduce query overhead

This limits the ability to create a distributed network of index sharing between AR.IO nodes.

Proposed Solution

1. HTTP Endpoint for CDB64 Distribution

Expose the existing data/cdb64-root-tx-index/ directory via HTTP, similar to the experimental datasets endpoint:

GET /local/cdb64-root-tx-index/              # Directory listing
GET /local/cdb64-root-tx-index/*.cdb64       # File download

Implementation:

  • Add route using serve-index + express.static (same pattern as datasets)
  • Reuse the existing directory that Cdb64RootTxIndex already watches
  • Add configuration: ENABLE_CDB64_ENDPOINT=true/false (default: false)
  • Include appropriate cache headers

Benefits:

  • Files served are the same files loaded for lookups
  • File watching already handles hot-reload when files change
  • Simple static file serving, no custom logic needed

2. Automated CDB64 Export Worker

A background worker that periodically exports the root TX index from SQLite to CDB64:

Configuration:

CDB64_EXPORT_ENABLED=true/false           # Default: false
CDB64_EXPORT_INTERVAL_BLOCKS=10000        # Export every N blocks
CDB64_EXPORT_INTERVAL_MS=86400000         # OR export on time interval (daily)
CDB64_EXPORT_MIN_RECORDS=1000             # Minimum new records before export

Behavior:

  • Export to data/cdb64-root-tx-index/root-tx-index-{blockHeight}.cdb64
  • Use atomic write (temp file + rename) for crash safety
  • Log export duration and record count
  • Trigger compaction check after export (if enabled)

Implementation:

  • Leverage existing tools/lib/export-sqlite-to-cdb64.ts logic
  • Create new worker in src/workers/cdb64-export-worker.ts
  • Query current block height for filename versioning

3. CDB64 Compaction (Merge)

Merge multiple smaller CDB64 files into fewer larger files to reduce lookup overhead.

Why compaction matters:

  • Cdb64RootTxIndex searches files in alphabetical order, first match wins
  • With many small files, lookups must potentially check multiple files
  • A single large file provides O(1) lookup with no file iteration

Configuration:

CDB64_COMPACT_ENABLED=true/false          # Default: false
CDB64_COMPACT_THRESHOLD_FILES=5           # Compact when >N files exist
CDB64_COMPACT_STRATEGY=auto/memory/external  # Default: auto
CDB64_COMPACT_MEMORY_THRESHOLD=10000000   # Switch to external sort above N keys

Two merge strategies:

Strategy A: In-Memory Deduplication

Best for datasets <10M keys (~600MB memory for dedup set)

Read all files → Track seen keys in Set → Write deduplicated output
  • Simple implementation
  • Memory usage: O(unique keys) for dedup tracking + O(records) for writer
  • Suitable for most gateway use cases

Strategy B: External Sort

For massive datasets (100M+ keys)

1. Extract: Stream each file → sorted temp chunks
2. Merge: K-way merge sorted chunks with dedup
3. Write: Stream to new CDB64
  • Memory efficient: only needs buffers for K file handles
  • Handles arbitrary scale
  • More complex implementation

Auto strategy selection:

  • Estimate key count from file sizes (~90 bytes/record average)
  • Use in-memory below threshold, external sort above

Naming convention:

Before compaction:
  root-tx-index-1450000.cdb64
  root-tx-index-1460000.cdb64
  root-tx-index-1470000.cdb64

After compaction:
  root-tx-index-1450000-1470000.cdb64  # Merged file

Compaction behavior:

  • Only delete source files after successful merge + verification
  • Verify merged file is readable before cleanup
  • Log compaction stats (files merged, keys deduplicated, size reduction)

Technical Details

CDB64 Iteration

The Cdb64Reader.entries() async generator yields key-value pairs in write order (not sorted):

for await (const { key, value } of reader.entries()) {
  // Streams through file sequentially
}

This is memory-efficient but means external sort requires actual sorting, not just streaming.

Memory Constraints

Two memory considerations for large merges:

  1. Deduplication Set: ~60 bytes per key in Set

    • 1M keys = ~60MB
    • 100M keys = ~6GB
  2. Cdb64Writer internals: ~32 bytes per record during finalize()

    • Holds { hash: bigint, position: bigint } array
    • 100M records = ~3.2GB

Multi-File Lookup Behavior

Current Cdb64RootTxIndex behavior with multiple files:

  • Files searched in alphabetical order
  • First match wins (no merging of values)
  • Compaction preserves this by processing files in alphabetical order

Value Format

MessagePack encoded with compact keys:

{ r: Buffer }                           // Simple: just root TX ID
{ r: Buffer, i: number, d: number }     // Complete: with offsets

Merge treats values as opaque bytes - no interpretation needed.

Implementation Plan

Phase 1: HTTP Endpoint

  • Add route in src/routes/ for /local/cdb64-root-tx-index
  • Add ENABLE_CDB64_ENDPOINT config option
  • Add to app.ts route registration
  • Document in docs/envs.md

Phase 2: Compaction Tool

  • Create src/lib/cdb64-merge.ts with merge logic
  • Implement in-memory strategy
  • Implement external sort strategy
  • Add auto-selection based on estimated size
  • Create CLI tool tools/compact-cdb64-root-tx-index
  • Add tests for merge correctness and deduplication

Phase 3: Export Worker

  • Create src/workers/cdb64-export-worker.ts
  • Integrate with block import events or timer
  • Add configuration options
  • Wire into src/system.ts

Phase 4: Compaction Worker (Optional)

  • Add automatic compaction triggering after exports
  • Configurable thresholds
  • Background execution to avoid blocking

Alternatives Considered

  1. Use datasets endpoint: Initially considered serving via /local/datasets/cdb64/, but serving the existing watched directory directly is simpler and ensures consistency between served and loaded files.

  2. Delete old files instead of merging: Simpler but loses historical data. Merging preserves all indexed data while reducing file count.

  3. Bloom filter for dedup: Lower memory than Set but allows rare duplicates. Decided exact dedup is preferred for index integrity.

Related

  • Existing export tool: tools/export-sqlite-to-cdb64
  • CDB64 format docs: docs/cdb64-format.md
  • CDB64 tools docs: docs/cdb64-tools.md
  • Root TX index: src/discovery/cdb64-root-tx-index.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions