Add DuckDB + Parquet GraphQL Backend

# DuckDB + Parquet GraphQL Implementation

## Summary

Add a DuckDB-based GraphQL query backend that queries Parquet files directly, providing an alternative to ClickHouse for gateways that want analytics-grade query performance without running a separate database server.

## Motivation

The current architecture offers two GraphQL backends:
1. **SQLite** - Default, works everywhere, but slower for large datasets
2. **ClickHouse** - Fast analytics queries, but requires running a separate server

DuckDB offers a middle ground:
- **Embedded** - No separate server process, runs in-process
- **Parquet-native** - Queries existing Parquet exports directly
- **Simpler deployment** - Single binary, no infrastructure overhead
- **Good performance** - Columnar engine optimized for analytics queries

The Parquet export infrastructure already exists (`scripts/parquet-export`, `src/workers/parquet-exporter.ts`), making this a natural extension.

## Technical Background

### Existing Parquet Schema

The export produces three tables with height-based partitioning:

**transactions** (`data/datasets/default/transactions/data/height={start}-{end}/*.parquet`):
```
id                      BLOB        -- Transaction/data item ID
height                  UINT64      -- Block height
block_transaction_index UINT16      -- Position in block
is_data_item            BOOLEAN     -- L1 transaction vs L2 data item
owner_address           BLOB        -- Owner wallet address
target                  BLOB        -- Recipient (nullable)
quantity                DECIMAL     -- Transfer amount
reward                  DECIMAL     -- Transaction fee
anchor                  BLOB        -- Last TX anchor
data_size               UINT64      -- Data size in bytes
content_type            VARCHAR     -- MIME type
parent                  BLOB        -- Parent bundle ID (for data items)
offset, size, data_offset, owner_offset, signature_offset, etc.
```

**tags** (`data/datasets/default/tags/data/height={start}-{end}/*.parquet`):
```
height                  UINT64      -- Block height (for partition pruning)
id                      BLOB        -- Transaction/data item ID (FK)
tag_index               UINT16      -- Tag position
tag_name                BLOB        -- Tag name (UTF-8)
tag_value               BLOB        -- Tag value (UTF-8)
is_data_item            BOOLEAN     -- L1 vs L2
```

**blocks** (`data/datasets/default/blocks/data/height={start}-{end}/*.parquet`):
```
indep_hash              BLOB        -- Block hash
height                  UINT64      -- Block height
previous_block          BLOB        -- Previous block hash
block_timestamp         INTEGER     -- Unix timestamp
tx_count                INTEGER     -- Transaction count
block_size              UINT64      -- Block size
```

### Key Differences from ClickHouse Schema

| Aspect | ClickHouse | Parquet |
|--------|------------|---------|
| Tags | Denormalized as `Array(Tuple(BLOB, BLOB))` | Normalized separate table |
| Block info | Embedded in transactions | Separate table, requires JOIN |
| Specialized tables | `id_transactions`, `owner_transactions`, `target_transactions` | Single transactions table |
| Partitioning | Sorted by primary key | Hive-style height ranges |

### Partition Structure

Files are organized in Hive-style directories:
```
data/datasets/default/
├── blocks/data/
│   ├── height=0-999/
│   ├── height=1000-1999/
│   └── ...
├── transactions/data/
│   └── (same structure)
└── tags/data/
    └── (same structure)
```

**Important**: Must use `hive_partitioning=false` when reading, as DuckDB otherwise replaces the `height` column with the partition range string.

## Proposed Implementation

### Architecture

```
src/database/
├── duckdb-gql/
│   ├── index.ts                    # DuckDbGqlDatabase class implementing GqlQueryable
│   ├── partition-resolver.ts       # Map height ranges → Parquet file globs
│   ├── query-builders/
│   │   ├── transactions.ts         # Transaction query builder with tag JOINs
│   │   ├── blocks.ts               # Block query builder
│   │   └── common.ts               # Shared cursor, ordering, filtering logic
│   └── result-mappers.ts           # Map DuckDB results → GQL types
└── composite-duckdb-gql.ts         # DuckDB + SQLite fallback for new data
```

### Core Interface

Implement `GqlQueryable` interface (same as ClickHouse):
```typescript
interface GqlQueryable {
  getGqlTransaction(args: { id: string }): Promise<GqlTransaction | null>;
  getGqlTransactions(args: GqlTransactionArgs): Promise<GqlTransactionsResult>;
  getGqlBlock(args: { id: string }): Promise<GqlBlock | undefined>;
  getGqlBlocks(args: GqlBlockArgs): Promise<GqlBlocksResult>;
}
```

### Query Patterns

**Basic transaction query with block info:**
```sql
SELECT
  t.*,
  b.indep_hash AS block_indep_hash,
  b.block_timestamp,
  b.previous_block
FROM read_parquet('${glob}', hive_partitioning=false) t
LEFT JOIN read_parquet('${blocksGlob}', hive_partitioning=false) b
  ON t.height = b.height
WHERE t.height BETWEEN ${minHeight} AND ${maxHeight}
ORDER BY t.height DESC, t.block_transaction_index DESC, t.is_data_item DESC, t.id DESC
LIMIT ${pageSize + 1}
```

**Tag filtering (normalized JOIN):**
```sql
SELECT t.* FROM transactions t
WHERE EXISTS (
  SELECT 1 FROM tags tg
  WHERE tg.id = t.id
    AND tg.tag_name = '\x4170702D4E616D65'::BLOB  -- 'App-Name' as hex
    AND tg.tag_value IN ('\x4172447269766500'::BLOB)  -- 'ArDrive' as hex
)
```

**Multiple tag filters (AND semantics):**
```sql
WHERE EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = X AND tag_value IN (...))
  AND EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = Y AND tag_value IN (...))
```

### Partition Resolution

Map query height ranges to file globs for efficient partition pruning:

```typescript
function resolvePartitionGlobs(
  basePath: string,
  minHeight?: number,
  maxHeight?: number,
  partitionSize: number = 1000
): string[] {
  // If no height filter, scan all
  if (minHeight === undefined && maxHeight === undefined) {
    return [`${basePath}/*/*.parquet`];
  }

  // Calculate which partitions overlap the query range
  const partitions: string[] = [];
  const start = Math.floor((minHeight ?? 0) / partitionSize) * partitionSize;
  const end = maxHeight ?? Infinity;

  for (let h = start; h <= end; h += partitionSize) {
    const partEnd = h + partitionSize - 1;
    partitions.push(`${basePath}/height=${h}-${partEnd}/*.parquet`);
  }

  return partitions;
}
```

### Composite Pattern (DuckDB + SQLite)

Same pattern as `CompositeClickHouseDatabase`:

```typescript
class CompositeDuckDbDatabase implements GqlQueryable {
  constructor(
    private duckdb: DuckDbGqlDatabase,      // Queries Parquet (stable data)
    private sqlite: GqlQueryable,            // Fallback for new/pending data
  ) {}

  async getGqlTransactions(args) {
    // Query DuckDB for stable data
    const duckdbResults = await this.duckdb.getGqlTransactions(args);

    // Query SQLite for new data
    const sqliteResults = await this.sqlite.getGqlTransactions(args);

    // Merge, dedupe, sort, paginate
    return this.mergeResults(duckdbResults, sqliteResults, args);
  }
}
```

## Implementation Plan

### Phase 1: Core Query Engine
- [ ] Create `DuckDbGqlDatabase` class with connection management
- [ ] Implement partition resolver for height-based glob generation
- [ ] Implement `getGqlBlocks` / `getGqlBlock` (simpler, good starting point)
- [ ] Add basic tests with sample Parquet data

### Phase 2: Transaction Queries
- [ ] Implement `getGqlTransactions` with height/ID filtering
- [ ] Add block info JOIN for transaction results
- [ ] Implement cursor-based pagination
- [ ] Add owner/recipient filtering

### Phase 3: Tag Queries
- [ ] Implement tag filtering with EXISTS subqueries
- [ ] Handle multiple tag filters (AND semantics)
- [ ] Handle tag value arrays (OR semantics within a tag)
- [ ] Optimize tag query performance

### Phase 4: Integration
- [ ] Create `CompositeDuckDbDatabase` for SQLite fallback
- [ ] Wire into `system.ts` with config flag
- [ ] Add environment variables (`DUCKDB_GQL_ENABLED`, `PARQUET_DATA_PATH`)
- [ ] Update documentation

### Phase 5: Optimization (Future)
- [ ] Connection pooling / persistent DuckDB instance
- [ ] Query result caching
- [ ] Consider additional Parquet exports (owner-partitioned, target-partitioned)
- [ ] Bloom filters for ID lookups

## Configuration

```bash
# Enable DuckDB GraphQL backend
DUCKDB_GQL_ENABLED=true

# Path to Parquet datasets (default: data/datasets/default)
PARQUET_DATA_PATH=data/datasets/default

# Partition size (must match export settings, default: 1000)
PARQUET_PARTITION_SIZE=1000
```

## Performance Considerations

| Query Pattern | Expected Performance | Notes |
|---------------|---------------------|-------|
| By height range | Good | Partition pruning eliminates irrelevant files |
| By ID | Moderate | Full scan unless bloom filters added |
| By owner/target | Moderate-Slow | Full scan of transactions table |
| By tags | Moderate | JOIN to tags table, depends on selectivity |
| Pagination | Good | DuckDB handles LIMIT/OFFSET efficiently |

For high-volume owner/target queries, consider generating additional Parquet exports partitioned by those fields (similar to ClickHouse's specialized tables).

## Open Questions

1. **Worker thread?** Should DuckDB queries run in a worker thread like SQLite to avoid blocking the event loop?

2. **Connection lifecycle?** Keep a persistent DuckDB connection vs create per-query?

3. **Memory limits?** Should we configure DuckDB memory limits to prevent OOM on large queries?

4. **Parquet caching?** DuckDB caches Parquet metadata - is this sufficient or do we need additional caching?

5. **ID lookup optimization?** Accept slower ID lookups, or invest in bloom filters / ID-partitioned exports?

## Acceptance Criteria

- [ ] All existing GraphQL queries work with DuckDB backend
- [ ] Results match SQLite/ClickHouse for the same data
- [ ] Cursor pagination works correctly
- [ ] Tag filtering works with AND/OR semantics
- [ ] New/pending data served from SQLite fallback
- [ ] Performance acceptable for typical gateway workloads
- [ ] Configuration documented in `docs/envs.md`

## Related

- Parquet export: `scripts/parquet-export`
- DuckDB schema: `src/database/duckdb/schema.sql`
- ClickHouse implementation: `src/database/composite-clickhouse.ts`
- GqlQueryable interface: `src/types.d.ts`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DuckDB + Parquet GraphQL Backend #571

DuckDB + Parquet GraphQL Implementation

Summary

Motivation

Technical Background

Existing Parquet Schema

Key Differences from ClickHouse Schema

Partition Structure

Proposed Implementation

Architecture

Core Interface

Query Patterns

Partition Resolution

Composite Pattern (DuckDB + SQLite)

Implementation Plan

Phase 1: Core Query Engine

Phase 2: Transaction Queries

Phase 3: Tag Queries

Phase 4: Integration

Phase 5: Optimization (Future)

Configuration

Performance Considerations

Open Questions

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	ClickHouse	Parquet
Tags	Denormalized as `Array(Tuple(BLOB, BLOB))`	Normalized separate table
Block info	Embedded in transactions	Separate table, requires JOIN
Specialized tables	`id_transactions`, `owner_transactions`, `target_transactions`	Single transactions table
Partitioning	Sorted by primary key	Hive-style height ranges

Query Pattern	Expected Performance	Notes
By height range	Good	Partition pruning eliminates irrelevant files
By ID	Moderate	Full scan unless bloom filters added
By owner/target	Moderate-Slow	Full scan of transactions table
By tags	Moderate	JOIN to tags table, depends on selectivity
Pagination	Good	DuckDB handles LIMIT/OFFSET efficiently

Add DuckDB + Parquet GraphQL Backend #571

Description

DuckDB + Parquet GraphQL Implementation

Summary

Motivation

Technical Background

Existing Parquet Schema

Key Differences from ClickHouse Schema

Partition Structure

Proposed Implementation

Architecture

Core Interface

Query Patterns

Partition Resolution

Composite Pattern (DuckDB + SQLite)

Implementation Plan

Phase 1: Core Query Engine

Phase 2: Transaction Queries

Phase 3: Tag Queries

Phase 4: Integration

Phase 5: Optimization (Future)

Configuration

Performance Considerations

Open Questions

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions