Skip to content

Support lazy reading of CDB64 root TX indexes from remote sources #569

@djwhitt

Description

@djwhitt

Summary

Add support for reading CDB64 root TX index files lazily from remote sources, enabling gateways to use distributed index files without requiring local storage. This extends the existing Cdb64RootTxIndex to support local files, Arweave TX IDs, and arbitrary HTTP endpoints as index sources.

Background

The current Cdb64RootTxIndex implementation (src/discovery/cdb64-root-tx-index.ts) provides O(1) lookups of data item ID → root TX ID mappings from pre-built CDB64 files stored locally. This works well but requires:

  • Downloading entire index files before use
  • Local storage for potentially large index files
  • Manual distribution/syncing of index files

Since the existing ContiguousDataSource interface already supports range-based fetching via the region?: Region parameter, and HTTP servers commonly support Range headers, we can fetch only the bytes needed for each lookup from CDB64 files stored remotely.

Requirements

Must Have

  • ByteRangeSource abstraction: Interface for random-access byte reads that can be backed by local files, Arweave, or HTTP endpoints
  • FileByteRangeSource: Implementation using fs.FileHandle for local files (minimal overhead wrapper)
  • ContiguousDataByteRangeSource: Implementation using ContiguousDataSource.getData() with region support for Arweave
  • HttpByteRangeSource: Implementation using HTTP Range requests for arbitrary URLs (S3, CDN, dedicated servers)
  • Refactored Cdb64Reader: Use ByteRangeSource instead of direct file handle access
  • Mixed source support: Allow configuring local files, Arweave TX IDs, and HTTP URLs as index sources
  • Caching for remote sources: Cache header (4KB) permanently and use LRU for hash table regions to minimize network round trips

Should Have

  • Configurable source order: Local files first (faster), then remote sources
  • Graceful degradation: If a remote source is unavailable, continue with other sources
  • Metrics: Track cache hit rates and fetch latencies for remote sources

Won't Have (for now)

  • Automatic discovery of index TX IDs (requires separate manifest/registry)
  • Write support for remote indexes (read-only)
  • Chunk-based fetching for Arweave (use HTTP Range requests via gateways)

Technical Design

ByteRangeSource Interface

interface ByteRangeSource {
  /** Read bytes at offset */
  read(offset: number, length: number): Promise<Buffer>;
  /** Total size if known (for validation) */
  getSize?(): Promise<number>;
  /** Cleanup resources */
  close?(): Promise<void>;
}

Implementations

// Local file - wraps fs.FileHandle
class FileByteRangeSource implements ByteRangeSource {
  async read(offset: number, length: number): Promise<Buffer> {
    const buffer = Buffer.alloc(length);
    await this.fileHandle.read(buffer, 0, length, offset);
    return buffer;
  }
}

// Arweave - uses existing ContiguousDataSource with region support
class ContiguousDataByteRangeSource implements ByteRangeSource {
  async read(offset: number, length: number): Promise<Buffer> {
    const result = await this.dataSource.getData({
      id: this.txId,
      region: { offset, size: length },
    });
    return streamToBuffer(result.stream);
  }
}

// HTTP - uses Range headers for arbitrary URLs (S3, CDN, etc.)
class HttpByteRangeSource implements ByteRangeSource {
  async read(offset: number, length: number): Promise<Buffer> {
    const response = await this.httpClient.get(this.url, {
      headers: {
        Range: `bytes=${offset}-${offset + length - 1}`,
      },
      responseType: 'arraybuffer',
    });
    return Buffer.from(response.data);
  }
}

// Caching wrapper - critical for remote source performance
class CachingByteRangeSource implements ByteRangeSource {
  // Cache header permanently, LRU for hash table regions
}

CDB64 Lookup Access Pattern

Each lookup requires reading:

  1. Header (4096 bytes) - table pointers, cached permanently
  2. Hash table slots (16 bytes each) - linear probing, 1-N reads
  3. Record (16 byte header + 32 byte key + ~50 byte value) - verification + data

With caching, typical lookups would be:

  • Local file: Same as today (negligible abstraction overhead)
  • Remote (warm cache): 1-2 network requests for hash table + record
  • Remote (cold): 2-3 network requests (header + hash table + record)

Configuration

# Existing - local files
CDB64_ROOT_TX_INDEX_PATH=/path/to/indexes/

# New - Arweave TX IDs (comma-separated, fetched via ContiguousDataSource)
CDB64_ROOT_TX_INDEX_TX_IDS=TxId123,TxId456

# New - HTTP URLs (comma-separated, supports S3, CDN, dedicated servers)
CDB64_ROOT_TX_INDEX_URLS=https://indexes.example.com/root.cdb,https://s3.amazonaws.com/bucket/index.cdb

Files to Modify

  • src/lib/cdb64.ts - Refactor Cdb64Reader to use ByteRangeSource
  • src/lib/byte-range-source.ts - New file with interface and implementations
  • src/discovery/cdb64-root-tx-index.ts - Support mixed local/Arweave/HTTP sources
  • src/config.ts - Add CDB64_ROOT_TX_INDEX_TX_IDS and CDB64_ROOT_TX_INDEX_URLS configs
  • src/system.ts - Wire up ContiguousDataSource for Arweave-backed indexes

Testing

  • Unit tests for ByteRangeSource implementations
  • Unit tests for refactored Cdb64Reader with mock ByteRangeSource
  • Integration tests with actual CDB64 files via all source types
  • Performance comparison: local vs remote (with/without cache)

Performance Considerations

  • Local files: Negligible overhead from abstraction (one extra function call)
  • Remote sources: Network latency dominates; caching is critical
    • Header cache: Eliminates 1 round trip per lookup
    • Hash table region cache: Reduces probing costs
    • Consider prefetching common hash table regions on initialization

Future Enhancements

  • Index manifest TX that lists all index TX IDs for automatic discovery
  • Composite indexes spanning multiple TXs with routing hints
  • Background warming of remote index caches

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions