Skip to content

Add DuckDB + Parquet GraphQL Backend #571

@djwhitt

Description

@djwhitt

DuckDB + Parquet GraphQL Implementation

Summary

Add a DuckDB-based GraphQL query backend that queries Parquet files directly, providing an alternative to ClickHouse for gateways that want analytics-grade query performance without running a separate database server.

Motivation

The current architecture offers two GraphQL backends:

  1. SQLite - Default, works everywhere, but slower for large datasets
  2. ClickHouse - Fast analytics queries, but requires running a separate server

DuckDB offers a middle ground:

  • Embedded - No separate server process, runs in-process
  • Parquet-native - Queries existing Parquet exports directly
  • Simpler deployment - Single binary, no infrastructure overhead
  • Good performance - Columnar engine optimized for analytics queries

The Parquet export infrastructure already exists (scripts/parquet-export, src/workers/parquet-exporter.ts), making this a natural extension.

Technical Background

Existing Parquet Schema

The export produces three tables with height-based partitioning:

transactions (data/datasets/default/transactions/data/height={start}-{end}/*.parquet):

id                      BLOB        -- Transaction/data item ID
height                  UINT64      -- Block height
block_transaction_index UINT16      -- Position in block
is_data_item            BOOLEAN     -- L1 transaction vs L2 data item
owner_address           BLOB        -- Owner wallet address
target                  BLOB        -- Recipient (nullable)
quantity                DECIMAL     -- Transfer amount
reward                  DECIMAL     -- Transaction fee
anchor                  BLOB        -- Last TX anchor
data_size               UINT64      -- Data size in bytes
content_type            VARCHAR     -- MIME type
parent                  BLOB        -- Parent bundle ID (for data items)
offset, size, data_offset, owner_offset, signature_offset, etc.

tags (data/datasets/default/tags/data/height={start}-{end}/*.parquet):

height                  UINT64      -- Block height (for partition pruning)
id                      BLOB        -- Transaction/data item ID (FK)
tag_index               UINT16      -- Tag position
tag_name                BLOB        -- Tag name (UTF-8)
tag_value               BLOB        -- Tag value (UTF-8)
is_data_item            BOOLEAN     -- L1 vs L2

blocks (data/datasets/default/blocks/data/height={start}-{end}/*.parquet):

indep_hash              BLOB        -- Block hash
height                  UINT64      -- Block height
previous_block          BLOB        -- Previous block hash
block_timestamp         INTEGER     -- Unix timestamp
tx_count                INTEGER     -- Transaction count
block_size              UINT64      -- Block size

Key Differences from ClickHouse Schema

Aspect ClickHouse Parquet
Tags Denormalized as Array(Tuple(BLOB, BLOB)) Normalized separate table
Block info Embedded in transactions Separate table, requires JOIN
Specialized tables id_transactions, owner_transactions, target_transactions Single transactions table
Partitioning Sorted by primary key Hive-style height ranges

Partition Structure

Files are organized in Hive-style directories:

data/datasets/default/
├── blocks/data/
│   ├── height=0-999/
│   ├── height=1000-1999/
│   └── ...
├── transactions/data/
│   └── (same structure)
└── tags/data/
    └── (same structure)

Important: Must use hive_partitioning=false when reading, as DuckDB otherwise replaces the height column with the partition range string.

Proposed Implementation

Architecture

src/database/
├── duckdb-gql/
│   ├── index.ts                    # DuckDbGqlDatabase class implementing GqlQueryable
│   ├── partition-resolver.ts       # Map height ranges → Parquet file globs
│   ├── query-builders/
│   │   ├── transactions.ts         # Transaction query builder with tag JOINs
│   │   ├── blocks.ts               # Block query builder
│   │   └── common.ts               # Shared cursor, ordering, filtering logic
│   └── result-mappers.ts           # Map DuckDB results → GQL types
└── composite-duckdb-gql.ts         # DuckDB + SQLite fallback for new data

Core Interface

Implement GqlQueryable interface (same as ClickHouse):

interface GqlQueryable {
  getGqlTransaction(args: { id: string }): Promise<GqlTransaction | null>;
  getGqlTransactions(args: GqlTransactionArgs): Promise<GqlTransactionsResult>;
  getGqlBlock(args: { id: string }): Promise<GqlBlock | undefined>;
  getGqlBlocks(args: GqlBlockArgs): Promise<GqlBlocksResult>;
}

Query Patterns

Basic transaction query with block info:

SELECT
  t.*,
  b.indep_hash AS block_indep_hash,
  b.block_timestamp,
  b.previous_block
FROM read_parquet('${glob}', hive_partitioning=false) t
LEFT JOIN read_parquet('${blocksGlob}', hive_partitioning=false) b
  ON t.height = b.height
WHERE t.height BETWEEN ${minHeight} AND ${maxHeight}
ORDER BY t.height DESC, t.block_transaction_index DESC, t.is_data_item DESC, t.id DESC
LIMIT ${pageSize + 1}

Tag filtering (normalized JOIN):

SELECT t.* FROM transactions t
WHERE EXISTS (
  SELECT 1 FROM tags tg
  WHERE tg.id = t.id
    AND tg.tag_name = '\x4170702D4E616D65'::BLOB  -- 'App-Name' as hex
    AND tg.tag_value IN ('\x4172447269766500'::BLOB)  -- 'ArDrive' as hex
)

Multiple tag filters (AND semantics):

WHERE EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = X AND tag_value IN (...))
  AND EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = Y AND tag_value IN (...))

Partition Resolution

Map query height ranges to file globs for efficient partition pruning:

function resolvePartitionGlobs(
  basePath: string,
  minHeight?: number,
  maxHeight?: number,
  partitionSize: number = 1000
): string[] {
  // If no height filter, scan all
  if (minHeight === undefined && maxHeight === undefined) {
    return [`${basePath}/*/*.parquet`];
  }

  // Calculate which partitions overlap the query range
  const partitions: string[] = [];
  const start = Math.floor((minHeight ?? 0) / partitionSize) * partitionSize;
  const end = maxHeight ?? Infinity;

  for (let h = start; h <= end; h += partitionSize) {
    const partEnd = h + partitionSize - 1;
    partitions.push(`${basePath}/height=${h}-${partEnd}/*.parquet`);
  }

  return partitions;
}

Composite Pattern (DuckDB + SQLite)

Same pattern as CompositeClickHouseDatabase:

class CompositeDuckDbDatabase implements GqlQueryable {
  constructor(
    private duckdb: DuckDbGqlDatabase,      // Queries Parquet (stable data)
    private sqlite: GqlQueryable,            // Fallback for new/pending data
  ) {}

  async getGqlTransactions(args) {
    // Query DuckDB for stable data
    const duckdbResults = await this.duckdb.getGqlTransactions(args);

    // Query SQLite for new data
    const sqliteResults = await this.sqlite.getGqlTransactions(args);

    // Merge, dedupe, sort, paginate
    return this.mergeResults(duckdbResults, sqliteResults, args);
  }
}

Implementation Plan

Phase 1: Core Query Engine

  • Create DuckDbGqlDatabase class with connection management
  • Implement partition resolver for height-based glob generation
  • Implement getGqlBlocks / getGqlBlock (simpler, good starting point)
  • Add basic tests with sample Parquet data

Phase 2: Transaction Queries

  • Implement getGqlTransactions with height/ID filtering
  • Add block info JOIN for transaction results
  • Implement cursor-based pagination
  • Add owner/recipient filtering

Phase 3: Tag Queries

  • Implement tag filtering with EXISTS subqueries
  • Handle multiple tag filters (AND semantics)
  • Handle tag value arrays (OR semantics within a tag)
  • Optimize tag query performance

Phase 4: Integration

  • Create CompositeDuckDbDatabase for SQLite fallback
  • Wire into system.ts with config flag
  • Add environment variables (DUCKDB_GQL_ENABLED, PARQUET_DATA_PATH)
  • Update documentation

Phase 5: Optimization (Future)

  • Connection pooling / persistent DuckDB instance
  • Query result caching
  • Consider additional Parquet exports (owner-partitioned, target-partitioned)
  • Bloom filters for ID lookups

Configuration

# Enable DuckDB GraphQL backend
DUCKDB_GQL_ENABLED=true

# Path to Parquet datasets (default: data/datasets/default)
PARQUET_DATA_PATH=data/datasets/default

# Partition size (must match export settings, default: 1000)
PARQUET_PARTITION_SIZE=1000

Performance Considerations

Query Pattern Expected Performance Notes
By height range Good Partition pruning eliminates irrelevant files
By ID Moderate Full scan unless bloom filters added
By owner/target Moderate-Slow Full scan of transactions table
By tags Moderate JOIN to tags table, depends on selectivity
Pagination Good DuckDB handles LIMIT/OFFSET efficiently

For high-volume owner/target queries, consider generating additional Parquet exports partitioned by those fields (similar to ClickHouse's specialized tables).

Open Questions

  1. Worker thread? Should DuckDB queries run in a worker thread like SQLite to avoid blocking the event loop?

  2. Connection lifecycle? Keep a persistent DuckDB connection vs create per-query?

  3. Memory limits? Should we configure DuckDB memory limits to prevent OOM on large queries?

  4. Parquet caching? DuckDB caches Parquet metadata - is this sufficient or do we need additional caching?

  5. ID lookup optimization? Accept slower ID lookups, or invest in bloom filters / ID-partitioned exports?

Acceptance Criteria

  • All existing GraphQL queries work with DuckDB backend
  • Results match SQLite/ClickHouse for the same data
  • Cursor pagination works correctly
  • Tag filtering works with AND/OR semantics
  • New/pending data served from SQLite fallback
  • Performance acceptable for typical gateway workloads
  • Configuration documented in docs/envs.md

Related

  • Parquet export: scripts/parquet-export
  • DuckDB schema: src/database/duckdb/schema.sql
  • ClickHouse implementation: src/database/composite-clickhouse.ts
  • GqlQueryable interface: src/types.d.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions