Skip to content

Implement CDB64-based historical root TX index for distributed data item mappings #543

@djwhitt

Description

@djwhitt

Problem Statement

The current root transaction index relies primarily on local SQLite databases (bundles.db and data.db) to resolve data item IDs to their root transaction IDs. However, SQLite databases don't contain all historical root TX mappings for the entire Arweave network history, only the data that has been indexed locally by the node.

When a data item ID is not found in the local database, the system falls back to external APIs (Gateways, GraphQL, Turbo) to discover the root transaction. While this works, it has several limitations:

  • Rate limiting: All external sources have rate limits to prevent abuse
  • Availability dependency: Requires external services to be available
  • Network latency: Each lookup requires network round-trips
  • Incomplete coverage: External APIs may not have all historical data either
  • Cascading failures: Circuit breakers prevent overload but also limit lookup success

This creates a gap for nodes that want to serve historical data without indexing the entire network history themselves.

Current Behavior

The root TX lookup follows this fallback chain (configurable via ROOT_TX_LOOKUP_ORDER):

  1. Local SQLite DB (src/database/standalone-sqlite.ts) - Fastest but incomplete
  2. Gateways (src/discovery/gateways-root-tx-index.ts) - HEAD requests to /raw/{id}
  3. GraphQL (src/discovery/graphql-root-tx-index.ts) - Traverses bundle parent chain
  4. Turbo (src/discovery/turbo-root-tx-index.ts) - Queries Turbo offsets API

Relevant code: src/discovery/composite-root-tx-index.ts

Proposed Solution

Introduce a CDB64-based historical root TX index that can be distributed as a static file containing historical data item ID → root transaction ID mappings.

CDB64 (Constant Database 64-bit) is a fast, read-only key-value store format:

  • Optimized for extremely fast lookups (average 2 disk accesses)
  • Immutable format perfect for historical data that doesn't change
  • 64-bit offsets support files >4GB (can store millions of mappings)
  • Single-file format makes distribution simple
  • Reference: https://docs.rs/cdb64/latest/cdb64/

The CDB64 index would act as a middle tier in the lookup chain:

Local SQLite DB → CDB64 Historical Index → External APIs

Benefits

  1. Distributable historical data: Operators can download a pre-built CDB64 file instead of re-indexing all historical data
  2. Fast lookups: CDB64 provides O(1) lookups with minimal disk I/O
  3. Reduced API dependency: Fewer fallbacks to rate-limited external services
  4. Offline capability: Works without network access to external APIs
  5. Lower barrier to entry: New nodes can serve historical data immediately
  6. Scalable: 64-bit format can handle the entire Arweave history
  7. Simple updates: CDB64 files can be periodically regenerated and redistributed

Scope

This issue focuses on the minimal implementation needed to read and generate CDB64 files.

In Scope:

  • Pure TypeScript CDB64 reader implementation
  • Reading CDB64 files from local filesystem paths
  • Cdb64RootTxIndex implementing the DataItemRootIndex interface
  • Integration with composite root TX index chain via ROOT_TX_LOOKUP_ORDER
  • Configuration via environment variables for CDB64 file path(s)
  • Minimal CDB64 generation tooling that reads from:
    • Legacy Arweave.net exports
    • Turbo Parquet exports

Out of Scope (future work):

  • Distribution mechanism for CDB64 files
  • Lazy/streaming reads directly from Arweave
  • Bloom filters for CDB64 files (may be added later as an optimization to reduce disk I/O for missing keys)

Why TypeScript?

The implementation will be pure TypeScript (no Rust bindings or FFI) to enable a future stream-based implementation that can read CDB64 data directly from Arweave contiguous data sources. A TypeScript implementation provides the foundation for this by allowing us to work with arbitrary byte streams rather than just filesystem handles.

Implementation Approach

  1. CDB64 Reader: Implement a TypeScript CDB64 file reader based on the CDB format specification with 64-bit offset support
  2. CDB64 Writer: Implement a TypeScript CDB64 file writer for generating index files
  3. Root TX Index: Create Cdb64RootTxIndex class implementing DataItemRootIndex interface
  4. Integration: Add cdb as an option in ROOT_TX_LOOKUP_ORDER
  5. Configuration: Add CDB64_ROOT_TX_INDEX_PATH environment variable for specifying the CDB64 file location
  6. Generation Tooling: Create CLI tool to generate CDB64 files from legacy Arweave.net and Turbo Parquet exports

Value Format

CDB64 stores key-value pairs where:

  • Keys: Data item IDs (32-byte binary, decoded from base64url)
  • Values: MessagePack-encoded objects containing root transaction information

MessagePack Configuration

Values are serialized using msgpackr with the same configuration used throughout the codebase (src/lib/encoding.ts):

const packr = new Packr({
  useRecords: false,    // compatible with other MessagePack implementations
  variableMapSize: true // sacrifice speed for space efficiency
});

Value Formats

The implementation supports two value formats to accommodate different data sources:

  1. Simple format (root ID only):

    {
      rootTxId: Buffer  // 32-byte root transaction ID
    }

    Use case: Legacy Arweave.net exports that only provide the root transaction ID mapping.

  2. Complete format (with offsets):

    {
      rootTxId: Buffer           // 32-byte root transaction ID
      rootDataItemOffset: number // byte offset of data item header in root TX data
      rootDataOffset: number     // byte offset of data payload in root TX data
    }

    Use case: Turbo Parquet exports that include offset information for direct data retrieval.

These offsets match the HTTP headers returned by the gateway:

  • X-AR-IO-Root-Data-Item-Offset: Enables direct byte-range requests to data item headers
  • X-AR-IO-Root-Data-Offset: Enables direct byte-range requests to data payloads

The Cdb64RootTxIndex implementation will deserialize values and return them in the DataItemRootIndex interface format, converting binary IDs to base64url strings. When complete offset information is available, the composite root TX index can use it directly without needing to fall back to other sources.

Design Decisions

Why use standard CDB64 hashing instead of leveraging SHA-256 keys directly?

Since data item IDs are already SHA-256 hashes (uniformly distributed), we considered simplifying the format by using the key bytes directly as hash table indices instead of applying CDB's DJB hash function. However, we decided to keep the standard CDB64 format:

  1. Negligible overhead: DJB's hash (h = ((h << 5) + h) ^ c per byte) adds only nanoseconds for a 32-byte key—completely dwarfed by disk I/O and MessagePack deserialization.

  2. Future flexibility: Using standard CDB64 allows the same format to be used with different key types in the future (e.g., CIDs for IPFS compatibility) without format changes.

  3. Proven format: CDB is battle-tested and well-documented. A custom format would require additional documentation and testing.

  4. No maintenance burden: Standard tooling and references apply directly.

The 32-bit hash output means keys are stored in full alongside values, with the hash only narrowing down which slot(s) to check. This is standard CDB behavior and works well regardless of key distribution.

Requirements

Must Have:

  • TypeScript CDB64 reader that can perform key lookups
  • TypeScript CDB64 writer that can generate index files
  • MessagePack serialization for values using existing msgpackr configuration
  • Support for both simple (root ID only) and complete (with offsets) value formats
  • Cdb64RootTxIndex implementing DataItemRootIndex interface
  • Configuration via CDB64_ROOT_TX_INDEX_PATH environment variable
  • Integration with ROOT_TX_LOOKUP_ORDER (add cdb option)
  • CLI tool to generate CDB64 from legacy Arweave.net exports
  • CLI tool to generate CDB64 from Turbo Parquet exports
  • Unit tests for CDB64 reader and writer
  • Integration tests with composite root TX index

Should Have:

  • Support for multiple CDB64 files (e.g., sharded by time period)
  • Graceful handling of missing/corrupt CDB64 files
  • Metrics for CDB64 lookup hits/misses

References

  • Root TX Index interface: src/types.d.ts (search for DataItemRootIndex)
  • HTTP headers: src/constants.ts (search for rootDataItemOffset, rootDataOffset)
  • MessagePack encoding: src/lib/encoding.ts
  • Composite pattern: src/discovery/composite-root-tx-index.ts
  • Database implementation: src/database/standalone-sqlite.ts:3574
  • CDB64 Rust crate (reference implementation): https://github.com/ever0de/cdb64-rs
  • CDB format spec: https://cr.yp.to/cdb.html

This issue proposes adding CDB64 support without removing or modifying existing lookup methods. The implementation can be introduced incrementally and enabled via configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions