Implement CDB64-based historical root TX index for distributed data item mappings

### Problem Statement

The current root transaction index relies primarily on local SQLite databases (`bundles.db` and `data.db`) to resolve data item IDs to their root transaction IDs. However, **SQLite databases don't contain all historical root TX mappings** for the entire Arweave network history, only the data that has been indexed locally by the node.

When a data item ID is not found in the local database, the system falls back to external APIs (Gateways, GraphQL, Turbo) to discover the root transaction. While this works, it has several limitations:

- **Rate limiting**: All external sources have rate limits to prevent abuse
- **Availability dependency**: Requires external services to be available
- **Network latency**: Each lookup requires network round-trips
- **Incomplete coverage**: External APIs may not have all historical data either
- **Cascading failures**: Circuit breakers prevent overload but also limit lookup success

This creates a gap for nodes that want to serve historical data without indexing the entire network history themselves.

### Current Behavior

The root TX lookup follows this fallback chain (configurable via `ROOT_TX_LOOKUP_ORDER`):

1. **Local SQLite DB** (`src/database/standalone-sqlite.ts`) - Fastest but incomplete
2. **Gateways** (`src/discovery/gateways-root-tx-index.ts`) - HEAD requests to `/raw/{id}`
3. **GraphQL** (`src/discovery/graphql-root-tx-index.ts`) - Traverses bundle parent chain
4. **Turbo** (`src/discovery/turbo-root-tx-index.ts`) - Queries Turbo offsets API

Relevant code: `src/discovery/composite-root-tx-index.ts`

### Proposed Solution

Introduce a **CDB64-based historical root TX index** that can be distributed as a static file containing historical data item ID → root transaction ID mappings.

**CDB64** (Constant Database 64-bit) is a fast, read-only key-value store format:
- Optimized for extremely fast lookups (average 2 disk accesses)
- Immutable format perfect for historical data that doesn't change
- 64-bit offsets support files >4GB (can store millions of mappings)
- Single-file format makes distribution simple
- Reference: https://docs.rs/cdb64/latest/cdb64/

The CDB64 index would act as a **middle tier** in the lookup chain:

```
Local SQLite DB → CDB64 Historical Index → External APIs
```

### Benefits

1. **Distributable historical data**: Operators can download a pre-built CDB64 file instead of re-indexing all historical data
2. **Fast lookups**: CDB64 provides O(1) lookups with minimal disk I/O
3. **Reduced API dependency**: Fewer fallbacks to rate-limited external services
4. **Offline capability**: Works without network access to external APIs
5. **Lower barrier to entry**: New nodes can serve historical data immediately
6. **Scalable**: 64-bit format can handle the entire Arweave history
7. **Simple updates**: CDB64 files can be periodically regenerated and redistributed

### Scope

This issue focuses on the **minimal implementation needed to read and generate CDB64 files**.

**In Scope:**
- Pure TypeScript CDB64 reader implementation
- Reading CDB64 files from local filesystem paths
- `Cdb64RootTxIndex` implementing the `DataItemRootIndex` interface
- Integration with composite root TX index chain via `ROOT_TX_LOOKUP_ORDER`
- Configuration via environment variables for CDB64 file path(s)
- Minimal CDB64 generation tooling that reads from:
  - Legacy Arweave.net exports
  - Turbo Parquet exports

**Out of Scope (future work):**
- Distribution mechanism for CDB64 files
- Lazy/streaming reads directly from Arweave
- Bloom filters for CDB64 files (may be added later as an optimization to reduce disk I/O for missing keys)

**Why TypeScript?**

The implementation will be pure TypeScript (no Rust bindings or FFI) to enable a future stream-based implementation that can read CDB64 data directly from Arweave contiguous data sources. A TypeScript implementation provides the foundation for this by allowing us to work with arbitrary byte streams rather than just filesystem handles.

### Implementation Approach

1. **CDB64 Reader**: Implement a TypeScript CDB64 file reader based on the [CDB format specification](https://cr.yp.to/cdb.html) with 64-bit offset support
2. **CDB64 Writer**: Implement a TypeScript CDB64 file writer for generating index files
3. **Root TX Index**: Create `Cdb64RootTxIndex` class implementing `DataItemRootIndex` interface
4. **Integration**: Add `cdb` as an option in `ROOT_TX_LOOKUP_ORDER`
5. **Configuration**: Add `CDB64_ROOT_TX_INDEX_PATH` environment variable for specifying the CDB64 file location
6. **Generation Tooling**: Create CLI tool to generate CDB64 files from legacy Arweave.net and Turbo Parquet exports

### Value Format

CDB64 stores key-value pairs where:
- **Keys**: Data item IDs (32-byte binary, decoded from base64url)
- **Values**: MessagePack-encoded objects containing root transaction information

**MessagePack Configuration**

Values are serialized using `msgpackr` with the same configuration used throughout the codebase (`src/lib/encoding.ts`):

```typescript
const packr = new Packr({
  useRecords: false,    // compatible with other MessagePack implementations
  variableMapSize: true // sacrifice speed for space efficiency
});
```

**Value Formats**

The implementation supports two value formats to accommodate different data sources:

1. **Simple format** (root ID only):
   ```typescript
   {
     rootTxId: Buffer  // 32-byte root transaction ID
   }
   ```
   Use case: Legacy Arweave.net exports that only provide the root transaction ID mapping.

2. **Complete format** (with offsets):
   ```typescript
   {
     rootTxId: Buffer           // 32-byte root transaction ID
     rootDataItemOffset: number // byte offset of data item header in root TX data
     rootDataOffset: number     // byte offset of data payload in root TX data
   }
   ```
   Use case: Turbo Parquet exports that include offset information for direct data retrieval.

These offsets match the HTTP headers returned by the gateway:
- `X-AR-IO-Root-Data-Item-Offset`: Enables direct byte-range requests to data item headers
- `X-AR-IO-Root-Data-Offset`: Enables direct byte-range requests to data payloads

The `Cdb64RootTxIndex` implementation will deserialize values and return them in the `DataItemRootIndex` interface format, converting binary IDs to base64url strings. When complete offset information is available, the composite root TX index can use it directly without needing to fall back to other sources.

### Design Decisions

**Why use standard CDB64 hashing instead of leveraging SHA-256 keys directly?**

Since data item IDs are already SHA-256 hashes (uniformly distributed), we considered simplifying the format by using the key bytes directly as hash table indices instead of applying CDB's DJB hash function. However, we decided to keep the standard CDB64 format:

1. **Negligible overhead**: DJB's hash (`h = ((h << 5) + h) ^ c` per byte) adds only nanoseconds for a 32-byte key—completely dwarfed by disk I/O and MessagePack deserialization.

2. **Future flexibility**: Using standard CDB64 allows the same format to be used with different key types in the future (e.g., CIDs for IPFS compatibility) without format changes.

3. **Proven format**: CDB is battle-tested and well-documented. A custom format would require additional documentation and testing.

4. **No maintenance burden**: Standard tooling and references apply directly.

The 32-bit hash output means keys are stored in full alongside values, with the hash only narrowing down which slot(s) to check. This is standard CDB behavior and works well regardless of key distribution.

### Requirements

**Must Have:**
- [ ] TypeScript CDB64 reader that can perform key lookups
- [ ] TypeScript CDB64 writer that can generate index files
- [ ] MessagePack serialization for values using existing `msgpackr` configuration
- [ ] Support for both simple (root ID only) and complete (with offsets) value formats
- [ ] `Cdb64RootTxIndex` implementing `DataItemRootIndex` interface
- [ ] Configuration via `CDB64_ROOT_TX_INDEX_PATH` environment variable
- [ ] Integration with `ROOT_TX_LOOKUP_ORDER` (add `cdb` option)
- [ ] CLI tool to generate CDB64 from legacy Arweave.net exports
- [ ] CLI tool to generate CDB64 from Turbo Parquet exports
- [ ] Unit tests for CDB64 reader and writer
- [ ] Integration tests with composite root TX index

**Should Have:**
- [ ] Support for multiple CDB64 files (e.g., sharded by time period)
- [ ] Graceful handling of missing/corrupt CDB64 files
- [ ] Metrics for CDB64 lookup hits/misses

### References

- Root TX Index interface: `src/types.d.ts` (search for `DataItemRootIndex`)
- HTTP headers: `src/constants.ts` (search for `rootDataItemOffset`, `rootDataOffset`)
- MessagePack encoding: `src/lib/encoding.ts`
- Composite pattern: `src/discovery/composite-root-tx-index.ts`
- Database implementation: `src/database/standalone-sqlite.ts:3574`
- CDB64 Rust crate (reference implementation): https://github.com/ever0de/cdb64-rs
- CDB format spec: https://cr.yp.to/cdb.html

---

This issue proposes adding CDB64 support without removing or modifying existing lookup methods. The implementation can be introduced incrementally and enabled via configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement CDB64-based historical root TX index for distributed data item mappings #543

Problem Statement

Current Behavior

Proposed Solution

Benefits

Scope

Implementation Approach

Value Format

Design Decisions

Requirements

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement CDB64-based historical root TX index for distributed data item mappings #543

Description

Problem Statement

Current Behavior

Proposed Solution

Benefits

Scope

Implementation Approach

Value Format

Design Decisions

Requirements

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions