CDB64 Root TX Index: Automated Export, Serving, and Compaction

# CDB64 Root TX Index: Automated Export, Serving, and Compaction

## Problem

Currently, CDB64 root TX index files must be manually generated and distributed. Operators who want to share their indexed data with other nodes have no automated way to:

1. **Export** - Periodically generate CDB64 files from the local SQLite database
2. **Serve** - Make CDB64 files available for other nodes to download
3. **Compact** - Merge multiple smaller CDB64 files into larger ones to reduce query overhead

This limits the ability to create a distributed network of index sharing between AR.IO nodes.

## Proposed Solution

### 1. HTTP Endpoint for CDB64 Distribution

Expose the existing `data/cdb64-root-tx-index/` directory via HTTP, similar to the experimental datasets endpoint:

```
GET /local/cdb64-root-tx-index/              # Directory listing
GET /local/cdb64-root-tx-index/*.cdb64       # File download
```

**Implementation:**
- Add route using `serve-index` + `express.static` (same pattern as datasets)
- Reuse the existing directory that `Cdb64RootTxIndex` already watches
- Add configuration: `ENABLE_CDB64_ENDPOINT=true/false` (default: false)
- Include appropriate cache headers

**Benefits:**
- Files served are the same files loaded for lookups
- File watching already handles hot-reload when files change
- Simple static file serving, no custom logic needed

### 2. Automated CDB64 Export Worker

A background worker that periodically exports the root TX index from SQLite to CDB64:

**Configuration:**
```bash
CDB64_EXPORT_ENABLED=true/false           # Default: false
CDB64_EXPORT_INTERVAL_BLOCKS=10000        # Export every N blocks
CDB64_EXPORT_INTERVAL_MS=86400000         # OR export on time interval (daily)
CDB64_EXPORT_MIN_RECORDS=1000             # Minimum new records before export
```

**Behavior:**
- Export to `data/cdb64-root-tx-index/root-tx-index-{blockHeight}.cdb64`
- Use atomic write (temp file + rename) for crash safety
- Log export duration and record count
- Trigger compaction check after export (if enabled)

**Implementation:**
- Leverage existing `tools/lib/export-sqlite-to-cdb64.ts` logic
- Create new worker in `src/workers/cdb64-export-worker.ts`
- Query current block height for filename versioning

### 3. CDB64 Compaction (Merge)

Merge multiple smaller CDB64 files into fewer larger files to reduce lookup overhead.

**Why compaction matters:**
- `Cdb64RootTxIndex` searches files in alphabetical order, first match wins
- With many small files, lookups must potentially check multiple files
- A single large file provides O(1) lookup with no file iteration

**Configuration:**
```bash
CDB64_COMPACT_ENABLED=true/false          # Default: false
CDB64_COMPACT_THRESHOLD_FILES=5           # Compact when >N files exist
CDB64_COMPACT_STRATEGY=auto/memory/external  # Default: auto
CDB64_COMPACT_MEMORY_THRESHOLD=10000000   # Switch to external sort above N keys
```

**Two merge strategies:**

#### Strategy A: In-Memory Deduplication
Best for datasets <10M keys (~600MB memory for dedup set)

```
Read all files → Track seen keys in Set → Write deduplicated output
```

- Simple implementation
- Memory usage: O(unique keys) for dedup tracking + O(records) for writer
- Suitable for most gateway use cases

#### Strategy B: External Sort
For massive datasets (100M+ keys)

```
1. Extract: Stream each file → sorted temp chunks
2. Merge: K-way merge sorted chunks with dedup
3. Write: Stream to new CDB64
```

- Memory efficient: only needs buffers for K file handles
- Handles arbitrary scale
- More complex implementation

**Auto strategy selection:**
- Estimate key count from file sizes (~90 bytes/record average)
- Use in-memory below threshold, external sort above

**Naming convention:**
```
Before compaction:
  root-tx-index-1450000.cdb64
  root-tx-index-1460000.cdb64
  root-tx-index-1470000.cdb64

After compaction:
  root-tx-index-1450000-1470000.cdb64  # Merged file
```

**Compaction behavior:**
- Only delete source files after successful merge + verification
- Verify merged file is readable before cleanup
- Log compaction stats (files merged, keys deduplicated, size reduction)

## Technical Details

### CDB64 Iteration

The `Cdb64Reader.entries()` async generator yields key-value pairs in **write order** (not sorted):

```typescript
for await (const { key, value } of reader.entries()) {
  // Streams through file sequentially
}
```

This is memory-efficient but means external sort requires actual sorting, not just streaming.

### Memory Constraints

Two memory considerations for large merges:

1. **Deduplication Set**: ~60 bytes per key in Set
   - 1M keys = ~60MB
   - 100M keys = ~6GB

2. **Cdb64Writer internals**: ~32 bytes per record during `finalize()`
   - Holds `{ hash: bigint, position: bigint }` array
   - 100M records = ~3.2GB

### Multi-File Lookup Behavior

Current `Cdb64RootTxIndex` behavior with multiple files:
- Files searched in alphabetical order
- First match wins (no merging of values)
- Compaction preserves this by processing files in alphabetical order

### Value Format

MessagePack encoded with compact keys:
```typescript
{ r: Buffer }                           // Simple: just root TX ID
{ r: Buffer, i: number, d: number }     // Complete: with offsets
```

Merge treats values as opaque bytes - no interpretation needed.

## Implementation Plan

### Phase 1: HTTP Endpoint
- [ ] Add route in `src/routes/` for `/local/cdb64-root-tx-index`
- [ ] Add `ENABLE_CDB64_ENDPOINT` config option
- [ ] Add to app.ts route registration
- [ ] Document in `docs/envs.md`

### Phase 2: Compaction Tool
- [ ] Create `src/lib/cdb64-merge.ts` with merge logic
- [ ] Implement in-memory strategy
- [ ] Implement external sort strategy
- [ ] Add auto-selection based on estimated size
- [ ] Create CLI tool `tools/compact-cdb64-root-tx-index`
- [ ] Add tests for merge correctness and deduplication

### Phase 3: Export Worker
- [ ] Create `src/workers/cdb64-export-worker.ts`
- [ ] Integrate with block import events or timer
- [ ] Add configuration options
- [ ] Wire into `src/system.ts`

### Phase 4: Compaction Worker (Optional)
- [ ] Add automatic compaction triggering after exports
- [ ] Configurable thresholds
- [ ] Background execution to avoid blocking

## Alternatives Considered

1. **Use datasets endpoint**: Initially considered serving via `/local/datasets/cdb64/`, but serving the existing watched directory directly is simpler and ensures consistency between served and loaded files.

2. **Delete old files instead of merging**: Simpler but loses historical data. Merging preserves all indexed data while reducing file count.

3. **Bloom filter for dedup**: Lower memory than Set but allows rare duplicates. Decided exact dedup is preferred for index integrity.

## Related

- Existing export tool: `tools/export-sqlite-to-cdb64`
- CDB64 format docs: `docs/cdb64-format.md`
- CDB64 tools docs: `docs/cdb64-tools.md`
- Root TX index: `src/discovery/cdb64-root-tx-index.ts`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CDB64 Root TX Index: Automated Export, Serving, and Compaction #570

CDB64 Root TX Index: Automated Export, Serving, and Compaction

Problem

Proposed Solution

1. HTTP Endpoint for CDB64 Distribution

2. Automated CDB64 Export Worker

3. CDB64 Compaction (Merge)

Strategy A: In-Memory Deduplication

Strategy B: External Sort

Technical Details

CDB64 Iteration

Memory Constraints

Multi-File Lookup Behavior

Value Format

Implementation Plan

Phase 1: HTTP Endpoint

Phase 2: Compaction Tool

Phase 3: Export Worker

Phase 4: Compaction Worker (Optional)

Alternatives Considered

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CDB64 Root TX Index: Automated Export, Serving, and Compaction #570

Description

CDB64 Root TX Index: Automated Export, Serving, and Compaction

Problem

Proposed Solution

1. HTTP Endpoint for CDB64 Distribution

2. Automated CDB64 Export Worker

3. CDB64 Compaction (Merge)

Strategy A: In-Memory Deduplication

Strategy B: External Sort

Technical Details

CDB64 Iteration

Memory Constraints

Multi-File Lookup Behavior

Value Format

Implementation Plan

Phase 1: HTTP Endpoint

Phase 2: Compaction Tool

Phase 3: Export Worker

Phase 4: Compaction Worker (Optional)

Alternatives Considered

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions