-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Summary
Add support for reading CDB64 root TX index files lazily from remote sources, enabling gateways to use distributed index files without requiring local storage. This extends the existing Cdb64RootTxIndex to support local files, Arweave TX IDs, and arbitrary HTTP endpoints as index sources.
Background
The current Cdb64RootTxIndex implementation (src/discovery/cdb64-root-tx-index.ts) provides O(1) lookups of data item ID → root TX ID mappings from pre-built CDB64 files stored locally. This works well but requires:
- Downloading entire index files before use
- Local storage for potentially large index files
- Manual distribution/syncing of index files
Since the existing ContiguousDataSource interface already supports range-based fetching via the region?: Region parameter, and HTTP servers commonly support Range headers, we can fetch only the bytes needed for each lookup from CDB64 files stored remotely.
Requirements
Must Have
- ByteRangeSource abstraction: Interface for random-access byte reads that can be backed by local files, Arweave, or HTTP endpoints
- FileByteRangeSource: Implementation using
fs.FileHandlefor local files (minimal overhead wrapper) - ContiguousDataByteRangeSource: Implementation using
ContiguousDataSource.getData()with region support for Arweave - HttpByteRangeSource: Implementation using HTTP Range requests for arbitrary URLs (S3, CDN, dedicated servers)
- Refactored Cdb64Reader: Use
ByteRangeSourceinstead of direct file handle access - Mixed source support: Allow configuring local files, Arweave TX IDs, and HTTP URLs as index sources
- Caching for remote sources: Cache header (4KB) permanently and use LRU for hash table regions to minimize network round trips
Should Have
- Configurable source order: Local files first (faster), then remote sources
- Graceful degradation: If a remote source is unavailable, continue with other sources
- Metrics: Track cache hit rates and fetch latencies for remote sources
Won't Have (for now)
- Automatic discovery of index TX IDs (requires separate manifest/registry)
- Write support for remote indexes (read-only)
- Chunk-based fetching for Arweave (use HTTP Range requests via gateways)
Technical Design
ByteRangeSource Interface
interface ByteRangeSource {
/** Read bytes at offset */
read(offset: number, length: number): Promise<Buffer>;
/** Total size if known (for validation) */
getSize?(): Promise<number>;
/** Cleanup resources */
close?(): Promise<void>;
}Implementations
// Local file - wraps fs.FileHandle
class FileByteRangeSource implements ByteRangeSource {
async read(offset: number, length: number): Promise<Buffer> {
const buffer = Buffer.alloc(length);
await this.fileHandle.read(buffer, 0, length, offset);
return buffer;
}
}
// Arweave - uses existing ContiguousDataSource with region support
class ContiguousDataByteRangeSource implements ByteRangeSource {
async read(offset: number, length: number): Promise<Buffer> {
const result = await this.dataSource.getData({
id: this.txId,
region: { offset, size: length },
});
return streamToBuffer(result.stream);
}
}
// HTTP - uses Range headers for arbitrary URLs (S3, CDN, etc.)
class HttpByteRangeSource implements ByteRangeSource {
async read(offset: number, length: number): Promise<Buffer> {
const response = await this.httpClient.get(this.url, {
headers: {
Range: `bytes=${offset}-${offset + length - 1}`,
},
responseType: 'arraybuffer',
});
return Buffer.from(response.data);
}
}
// Caching wrapper - critical for remote source performance
class CachingByteRangeSource implements ByteRangeSource {
// Cache header permanently, LRU for hash table regions
}CDB64 Lookup Access Pattern
Each lookup requires reading:
- Header (4096 bytes) - table pointers, cached permanently
- Hash table slots (16 bytes each) - linear probing, 1-N reads
- Record (16 byte header + 32 byte key + ~50 byte value) - verification + data
With caching, typical lookups would be:
- Local file: Same as today (negligible abstraction overhead)
- Remote (warm cache): 1-2 network requests for hash table + record
- Remote (cold): 2-3 network requests (header + hash table + record)
Configuration
# Existing - local files
CDB64_ROOT_TX_INDEX_PATH=/path/to/indexes/
# New - Arweave TX IDs (comma-separated, fetched via ContiguousDataSource)
CDB64_ROOT_TX_INDEX_TX_IDS=TxId123,TxId456
# New - HTTP URLs (comma-separated, supports S3, CDN, dedicated servers)
CDB64_ROOT_TX_INDEX_URLS=https://indexes.example.com/root.cdb,https://s3.amazonaws.com/bucket/index.cdbFiles to Modify
src/lib/cdb64.ts- RefactorCdb64Readerto useByteRangeSourcesrc/lib/byte-range-source.ts- New file with interface and implementationssrc/discovery/cdb64-root-tx-index.ts- Support mixed local/Arweave/HTTP sourcessrc/config.ts- AddCDB64_ROOT_TX_INDEX_TX_IDSandCDB64_ROOT_TX_INDEX_URLSconfigssrc/system.ts- Wire up ContiguousDataSource for Arweave-backed indexes
Testing
- Unit tests for
ByteRangeSourceimplementations - Unit tests for refactored
Cdb64Readerwith mockByteRangeSource - Integration tests with actual CDB64 files via all source types
- Performance comparison: local vs remote (with/without cache)
Performance Considerations
- Local files: Negligible overhead from abstraction (one extra function call)
- Remote sources: Network latency dominates; caching is critical
- Header cache: Eliminates 1 round trip per lookup
- Hash table region cache: Reduces probing costs
- Consider prefetching common hash table regions on initialization
Future Enhancements
- Index manifest TX that lists all index TX IDs for automatic discovery
- Composite indexes spanning multiple TXs with routing hints
- Background warming of remote index caches