-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Problem
The current parquet-repartition script only supports single-dimension partitioning - either by owner address OR by a tag value, but not both simultaneously. This limits analytical use cases that need to query data filtered by multiple dimensions.
Current behavior:
# This works - single dimension
./scripts/parquet-repartition --partition-by-owner ...
./scripts/parquet-repartition --tag-name "Drive-Id" ...
# This fails - multi-dimension
./scripts/parquet-repartition --partition-by-owner --tag-name "Drive-Id" ...
# Error: Cannot specify both --tag-name and --partition-by-ownerUse Cases
-
Wallet + Application Analysis: Analyze a specific wallet's activity within a particular application (e.g., all ArDrive transactions for wallet X)
-
Owner + Drive Analytics: Query all data items for a specific owner within a specific drive, enabling efficient per-user drive browsing
-
Multi-tag Partitioning: Partition by multiple tags (e.g.,
App-Name+Content-Type) for application-specific content analysis -
Hierarchical Data Organization: Create datasets optimized for drill-down queries (e.g., start with owner, then filter by drive, then by content type)
Proposed Solution
Option A: Ordered Partition Keys (Recommended)
Support an ordered list of partition keys that creates nested directory structures:
./scripts/parquet-repartition \
--input-dir data/datasets/default \
--output-dir data/datasets/by-owner-drive \
--partition-keys "owner_address,Drive-Id" \
--min-occurrences 100Output structure:
output/
└── transactions/data/
└── owner_address=9_666Wkk2GzL.../
├── drive_id=abc123/
│ └── *.parquet
└── drive_id=xyz789/
└── *.parquet
Option B: Composite Partition Key
Create a single partition level with composite keys:
./scripts/parquet-repartition \
--partition-composite "owner_address,Drive-Id"Output structure:
output/
└── transactions/data/
└── owner_address=9_666Wkk2GzL...__drive_id=abc123/
└── *.parquet
Option A is preferred as it enables partition pruning at each level.
Requirements
Must Have
- Support 2+ partition dimensions in a single pass
- Maintain Hive-style directory naming (
key=value/) - Generate valid Apache Iceberg metadata for multi-level partitions
- Preserve existing single-dimension functionality (backward compatible)
Should Have
- Allow mixing owner-based and tag-based partition keys
- Support
--preserve-heightas an additional (final) partition level - Configurable partition key ordering
- Per-dimension filtering (e.g.,
--min-occurrencesper dimension)
Could Have
- Support for more than 2 custom dimensions (3+ level nesting)
- Partition key validation (warn if cardinality is too high)
- Automatic partition key ordering optimization based on cardinality
Technical Considerations
-
Cardinality Management: High-cardinality combinations (e.g., owner x drive) could create millions of partitions. Need safeguards:
- Require
--min-occurrencesfor at least the leaf dimension - Warn when partition count exceeds threshold
- Support
--max-partitionsat each level
- Require
-
Memory Efficiency: Current height-chunking approach should extend to multi-dimensional processing
-
Iceberg Schema: PyIceberg metadata generation needs updates to handle:
- Multiple partition transforms
- Nested partition specs
- Proper field IDs for each partition column
-
Query Performance: Document recommended partition key ordering (low cardinality first)
Example CLI Design
# Two-level: owner then drive
./scripts/parquet-repartition \
--input-dir data/datasets/default \
--output-dir data/datasets/analytics \
--partition-keys "owner_address,Drive-Id" \
--min-occurrences 50 \
--generate-iceberg
# Three-level: owner, drive, then height
./scripts/parquet-repartition \
--input-dir data/datasets/default \
--output-dir data/datasets/analytics \
--partition-keys "owner_address,Drive-Id" \
--preserve-height \
--height-partition-size 10000
# Two tags
./scripts/parquet-repartition \
--input-dir data/datasets/default \
--output-dir data/datasets/by-app-content \
--partition-keys "App-Name,Content-Type" \
--min-occurrences 100Related Files
scripts/parquet-repartition- Main repartitioning scriptscripts/parquet-export- Initial export (height partitioning)scripts/generate-iceberg-metadata- Iceberg metadata generationscripts/README-repartition.md- Documentation