-
Notifications
You must be signed in to change notification settings - Fork 74
Description
DuckDB + Parquet GraphQL Implementation
Summary
Add a DuckDB-based GraphQL query backend that queries Parquet files directly, providing an alternative to ClickHouse for gateways that want analytics-grade query performance without running a separate database server.
Motivation
The current architecture offers two GraphQL backends:
- SQLite - Default, works everywhere, but slower for large datasets
- ClickHouse - Fast analytics queries, but requires running a separate server
DuckDB offers a middle ground:
- Embedded - No separate server process, runs in-process
- Parquet-native - Queries existing Parquet exports directly
- Simpler deployment - Single binary, no infrastructure overhead
- Good performance - Columnar engine optimized for analytics queries
The Parquet export infrastructure already exists (scripts/parquet-export, src/workers/parquet-exporter.ts), making this a natural extension.
Technical Background
Existing Parquet Schema
The export produces three tables with height-based partitioning:
transactions (data/datasets/default/transactions/data/height={start}-{end}/*.parquet):
id BLOB -- Transaction/data item ID
height UINT64 -- Block height
block_transaction_index UINT16 -- Position in block
is_data_item BOOLEAN -- L1 transaction vs L2 data item
owner_address BLOB -- Owner wallet address
target BLOB -- Recipient (nullable)
quantity DECIMAL -- Transfer amount
reward DECIMAL -- Transaction fee
anchor BLOB -- Last TX anchor
data_size UINT64 -- Data size in bytes
content_type VARCHAR -- MIME type
parent BLOB -- Parent bundle ID (for data items)
offset, size, data_offset, owner_offset, signature_offset, etc.
tags (data/datasets/default/tags/data/height={start}-{end}/*.parquet):
height UINT64 -- Block height (for partition pruning)
id BLOB -- Transaction/data item ID (FK)
tag_index UINT16 -- Tag position
tag_name BLOB -- Tag name (UTF-8)
tag_value BLOB -- Tag value (UTF-8)
is_data_item BOOLEAN -- L1 vs L2
blocks (data/datasets/default/blocks/data/height={start}-{end}/*.parquet):
indep_hash BLOB -- Block hash
height UINT64 -- Block height
previous_block BLOB -- Previous block hash
block_timestamp INTEGER -- Unix timestamp
tx_count INTEGER -- Transaction count
block_size UINT64 -- Block size
Key Differences from ClickHouse Schema
| Aspect | ClickHouse | Parquet |
|---|---|---|
| Tags | Denormalized as Array(Tuple(BLOB, BLOB)) |
Normalized separate table |
| Block info | Embedded in transactions | Separate table, requires JOIN |
| Specialized tables | id_transactions, owner_transactions, target_transactions |
Single transactions table |
| Partitioning | Sorted by primary key | Hive-style height ranges |
Partition Structure
Files are organized in Hive-style directories:
data/datasets/default/
├── blocks/data/
│ ├── height=0-999/
│ ├── height=1000-1999/
│ └── ...
├── transactions/data/
│ └── (same structure)
└── tags/data/
└── (same structure)
Important: Must use hive_partitioning=false when reading, as DuckDB otherwise replaces the height column with the partition range string.
Proposed Implementation
Architecture
src/database/
├── duckdb-gql/
│ ├── index.ts # DuckDbGqlDatabase class implementing GqlQueryable
│ ├── partition-resolver.ts # Map height ranges → Parquet file globs
│ ├── query-builders/
│ │ ├── transactions.ts # Transaction query builder with tag JOINs
│ │ ├── blocks.ts # Block query builder
│ │ └── common.ts # Shared cursor, ordering, filtering logic
│ └── result-mappers.ts # Map DuckDB results → GQL types
└── composite-duckdb-gql.ts # DuckDB + SQLite fallback for new data
Core Interface
Implement GqlQueryable interface (same as ClickHouse):
interface GqlQueryable {
getGqlTransaction(args: { id: string }): Promise<GqlTransaction | null>;
getGqlTransactions(args: GqlTransactionArgs): Promise<GqlTransactionsResult>;
getGqlBlock(args: { id: string }): Promise<GqlBlock | undefined>;
getGqlBlocks(args: GqlBlockArgs): Promise<GqlBlocksResult>;
}Query Patterns
Basic transaction query with block info:
SELECT
t.*,
b.indep_hash AS block_indep_hash,
b.block_timestamp,
b.previous_block
FROM read_parquet('${glob}', hive_partitioning=false) t
LEFT JOIN read_parquet('${blocksGlob}', hive_partitioning=false) b
ON t.height = b.height
WHERE t.height BETWEEN ${minHeight} AND ${maxHeight}
ORDER BY t.height DESC, t.block_transaction_index DESC, t.is_data_item DESC, t.id DESC
LIMIT ${pageSize + 1}Tag filtering (normalized JOIN):
SELECT t.* FROM transactions t
WHERE EXISTS (
SELECT 1 FROM tags tg
WHERE tg.id = t.id
AND tg.tag_name = '\x4170702D4E616D65'::BLOB -- 'App-Name' as hex
AND tg.tag_value IN ('\x4172447269766500'::BLOB) -- 'ArDrive' as hex
)Multiple tag filters (AND semantics):
WHERE EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = X AND tag_value IN (...))
AND EXISTS (SELECT 1 FROM tags WHERE id = t.id AND tag_name = Y AND tag_value IN (...))Partition Resolution
Map query height ranges to file globs for efficient partition pruning:
function resolvePartitionGlobs(
basePath: string,
minHeight?: number,
maxHeight?: number,
partitionSize: number = 1000
): string[] {
// If no height filter, scan all
if (minHeight === undefined && maxHeight === undefined) {
return [`${basePath}/*/*.parquet`];
}
// Calculate which partitions overlap the query range
const partitions: string[] = [];
const start = Math.floor((minHeight ?? 0) / partitionSize) * partitionSize;
const end = maxHeight ?? Infinity;
for (let h = start; h <= end; h += partitionSize) {
const partEnd = h + partitionSize - 1;
partitions.push(`${basePath}/height=${h}-${partEnd}/*.parquet`);
}
return partitions;
}Composite Pattern (DuckDB + SQLite)
Same pattern as CompositeClickHouseDatabase:
class CompositeDuckDbDatabase implements GqlQueryable {
constructor(
private duckdb: DuckDbGqlDatabase, // Queries Parquet (stable data)
private sqlite: GqlQueryable, // Fallback for new/pending data
) {}
async getGqlTransactions(args) {
// Query DuckDB for stable data
const duckdbResults = await this.duckdb.getGqlTransactions(args);
// Query SQLite for new data
const sqliteResults = await this.sqlite.getGqlTransactions(args);
// Merge, dedupe, sort, paginate
return this.mergeResults(duckdbResults, sqliteResults, args);
}
}Implementation Plan
Phase 1: Core Query Engine
- Create
DuckDbGqlDatabaseclass with connection management - Implement partition resolver for height-based glob generation
- Implement
getGqlBlocks/getGqlBlock(simpler, good starting point) - Add basic tests with sample Parquet data
Phase 2: Transaction Queries
- Implement
getGqlTransactionswith height/ID filtering - Add block info JOIN for transaction results
- Implement cursor-based pagination
- Add owner/recipient filtering
Phase 3: Tag Queries
- Implement tag filtering with EXISTS subqueries
- Handle multiple tag filters (AND semantics)
- Handle tag value arrays (OR semantics within a tag)
- Optimize tag query performance
Phase 4: Integration
- Create
CompositeDuckDbDatabasefor SQLite fallback - Wire into
system.tswith config flag - Add environment variables (
DUCKDB_GQL_ENABLED,PARQUET_DATA_PATH) - Update documentation
Phase 5: Optimization (Future)
- Connection pooling / persistent DuckDB instance
- Query result caching
- Consider additional Parquet exports (owner-partitioned, target-partitioned)
- Bloom filters for ID lookups
Configuration
# Enable DuckDB GraphQL backend
DUCKDB_GQL_ENABLED=true
# Path to Parquet datasets (default: data/datasets/default)
PARQUET_DATA_PATH=data/datasets/default
# Partition size (must match export settings, default: 1000)
PARQUET_PARTITION_SIZE=1000Performance Considerations
| Query Pattern | Expected Performance | Notes |
|---|---|---|
| By height range | Good | Partition pruning eliminates irrelevant files |
| By ID | Moderate | Full scan unless bloom filters added |
| By owner/target | Moderate-Slow | Full scan of transactions table |
| By tags | Moderate | JOIN to tags table, depends on selectivity |
| Pagination | Good | DuckDB handles LIMIT/OFFSET efficiently |
For high-volume owner/target queries, consider generating additional Parquet exports partitioned by those fields (similar to ClickHouse's specialized tables).
Open Questions
-
Worker thread? Should DuckDB queries run in a worker thread like SQLite to avoid blocking the event loop?
-
Connection lifecycle? Keep a persistent DuckDB connection vs create per-query?
-
Memory limits? Should we configure DuckDB memory limits to prevent OOM on large queries?
-
Parquet caching? DuckDB caches Parquet metadata - is this sufficient or do we need additional caching?
-
ID lookup optimization? Accept slower ID lookups, or invest in bloom filters / ID-partitioned exports?
Acceptance Criteria
- All existing GraphQL queries work with DuckDB backend
- Results match SQLite/ClickHouse for the same data
- Cursor pagination works correctly
- Tag filtering works with AND/OR semantics
- New/pending data served from SQLite fallback
- Performance acceptable for typical gateway workloads
- Configuration documented in
docs/envs.md
Related
- Parquet export:
scripts/parquet-export - DuckDB schema:
src/database/duckdb/schema.sql - ClickHouse implementation:
src/database/composite-clickhouse.ts - GqlQueryable interface:
src/types.d.ts