-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Problem Statement
The current root transaction index relies primarily on local SQLite databases (bundles.db and data.db) to resolve data item IDs to their root transaction IDs. However, SQLite databases don't contain all historical root TX mappings for the entire Arweave network history, only the data that has been indexed locally by the node.
When a data item ID is not found in the local database, the system falls back to external APIs (Gateways, GraphQL, Turbo) to discover the root transaction. While this works, it has several limitations:
- Rate limiting: All external sources have rate limits to prevent abuse
- Availability dependency: Requires external services to be available
- Network latency: Each lookup requires network round-trips
- Incomplete coverage: External APIs may not have all historical data either
- Cascading failures: Circuit breakers prevent overload but also limit lookup success
This creates a gap for nodes that want to serve historical data without indexing the entire network history themselves.
Current Behavior
The root TX lookup follows this fallback chain (configurable via ROOT_TX_LOOKUP_ORDER):
- Local SQLite DB (
src/database/standalone-sqlite.ts) - Fastest but incomplete - Gateways (
src/discovery/gateways-root-tx-index.ts) - HEAD requests to/raw/{id} - GraphQL (
src/discovery/graphql-root-tx-index.ts) - Traverses bundle parent chain - Turbo (
src/discovery/turbo-root-tx-index.ts) - Queries Turbo offsets API
Relevant code: src/discovery/composite-root-tx-index.ts
Proposed Solution
Introduce a CDB64-based historical root TX index that can be distributed as a static file containing historical data item ID → root transaction ID mappings.
CDB64 (Constant Database 64-bit) is a fast, read-only key-value store format:
- Optimized for extremely fast lookups (average 2 disk accesses)
- Immutable format perfect for historical data that doesn't change
- 64-bit offsets support files >4GB (can store millions of mappings)
- Single-file format makes distribution simple
- Reference: https://docs.rs/cdb64/latest/cdb64/
The CDB64 index would act as a middle tier in the lookup chain:
Local SQLite DB → CDB64 Historical Index → External APIs
Benefits
- Distributable historical data: Operators can download a pre-built CDB64 file instead of re-indexing all historical data
- Fast lookups: CDB64 provides O(1) lookups with minimal disk I/O
- Reduced API dependency: Fewer fallbacks to rate-limited external services
- Offline capability: Works without network access to external APIs
- Lower barrier to entry: New nodes can serve historical data immediately
- Scalable: 64-bit format can handle the entire Arweave history
- Simple updates: CDB64 files can be periodically regenerated and redistributed
Scope
This issue focuses on the minimal implementation needed to read and generate CDB64 files.
In Scope:
- Pure TypeScript CDB64 reader implementation
- Reading CDB64 files from local filesystem paths
Cdb64RootTxIndeximplementing theDataItemRootIndexinterface- Integration with composite root TX index chain via
ROOT_TX_LOOKUP_ORDER - Configuration via environment variables for CDB64 file path(s)
- Minimal CDB64 generation tooling that reads from:
- Legacy Arweave.net exports
- Turbo Parquet exports
Out of Scope (future work):
- Distribution mechanism for CDB64 files
- Lazy/streaming reads directly from Arweave
- Bloom filters for CDB64 files (may be added later as an optimization to reduce disk I/O for missing keys)
Why TypeScript?
The implementation will be pure TypeScript (no Rust bindings or FFI) to enable a future stream-based implementation that can read CDB64 data directly from Arweave contiguous data sources. A TypeScript implementation provides the foundation for this by allowing us to work with arbitrary byte streams rather than just filesystem handles.
Implementation Approach
- CDB64 Reader: Implement a TypeScript CDB64 file reader based on the CDB format specification with 64-bit offset support
- CDB64 Writer: Implement a TypeScript CDB64 file writer for generating index files
- Root TX Index: Create
Cdb64RootTxIndexclass implementingDataItemRootIndexinterface - Integration: Add
cdbas an option inROOT_TX_LOOKUP_ORDER - Configuration: Add
CDB64_ROOT_TX_INDEX_PATHenvironment variable for specifying the CDB64 file location - Generation Tooling: Create CLI tool to generate CDB64 files from legacy Arweave.net and Turbo Parquet exports
Value Format
CDB64 stores key-value pairs where:
- Keys: Data item IDs (32-byte binary, decoded from base64url)
- Values: MessagePack-encoded objects containing root transaction information
MessagePack Configuration
Values are serialized using msgpackr with the same configuration used throughout the codebase (src/lib/encoding.ts):
const packr = new Packr({
useRecords: false, // compatible with other MessagePack implementations
variableMapSize: true // sacrifice speed for space efficiency
});Value Formats
The implementation supports two value formats to accommodate different data sources:
-
Simple format (root ID only):
{ rootTxId: Buffer // 32-byte root transaction ID }
Use case: Legacy Arweave.net exports that only provide the root transaction ID mapping.
-
Complete format (with offsets):
{ rootTxId: Buffer // 32-byte root transaction ID rootDataItemOffset: number // byte offset of data item header in root TX data rootDataOffset: number // byte offset of data payload in root TX data }
Use case: Turbo Parquet exports that include offset information for direct data retrieval.
These offsets match the HTTP headers returned by the gateway:
X-AR-IO-Root-Data-Item-Offset: Enables direct byte-range requests to data item headersX-AR-IO-Root-Data-Offset: Enables direct byte-range requests to data payloads
The Cdb64RootTxIndex implementation will deserialize values and return them in the DataItemRootIndex interface format, converting binary IDs to base64url strings. When complete offset information is available, the composite root TX index can use it directly without needing to fall back to other sources.
Design Decisions
Why use standard CDB64 hashing instead of leveraging SHA-256 keys directly?
Since data item IDs are already SHA-256 hashes (uniformly distributed), we considered simplifying the format by using the key bytes directly as hash table indices instead of applying CDB's DJB hash function. However, we decided to keep the standard CDB64 format:
-
Negligible overhead: DJB's hash (
h = ((h << 5) + h) ^ cper byte) adds only nanoseconds for a 32-byte key—completely dwarfed by disk I/O and MessagePack deserialization. -
Future flexibility: Using standard CDB64 allows the same format to be used with different key types in the future (e.g., CIDs for IPFS compatibility) without format changes.
-
Proven format: CDB is battle-tested and well-documented. A custom format would require additional documentation and testing.
-
No maintenance burden: Standard tooling and references apply directly.
The 32-bit hash output means keys are stored in full alongside values, with the hash only narrowing down which slot(s) to check. This is standard CDB behavior and works well regardless of key distribution.
Requirements
Must Have:
- TypeScript CDB64 reader that can perform key lookups
- TypeScript CDB64 writer that can generate index files
- MessagePack serialization for values using existing
msgpackrconfiguration - Support for both simple (root ID only) and complete (with offsets) value formats
-
Cdb64RootTxIndeximplementingDataItemRootIndexinterface - Configuration via
CDB64_ROOT_TX_INDEX_PATHenvironment variable - Integration with
ROOT_TX_LOOKUP_ORDER(addcdboption) - CLI tool to generate CDB64 from legacy Arweave.net exports
- CLI tool to generate CDB64 from Turbo Parquet exports
- Unit tests for CDB64 reader and writer
- Integration tests with composite root TX index
Should Have:
- Support for multiple CDB64 files (e.g., sharded by time period)
- Graceful handling of missing/corrupt CDB64 files
- Metrics for CDB64 lookup hits/misses
References
- Root TX Index interface:
src/types.d.ts(search forDataItemRootIndex) - HTTP headers:
src/constants.ts(search forrootDataItemOffset,rootDataOffset) - MessagePack encoding:
src/lib/encoding.ts - Composite pattern:
src/discovery/composite-root-tx-index.ts - Database implementation:
src/database/standalone-sqlite.ts:3574 - CDB64 Rust crate (reference implementation): https://github.com/ever0de/cdb64-rs
- CDB format spec: https://cr.yp.to/cdb.html
This issue proposes adding CDB64 support without removing or modifying existing lookup methods. The implementation can be introduced incrementally and enabled via configuration.