Skip to content

feat: Support multi-dimensional Parquet partitioning (owner + tag combinations) #573

@djwhitt

Description

@djwhitt

Problem

The current parquet-repartition script only supports single-dimension partitioning - either by owner address OR by a tag value, but not both simultaneously. This limits analytical use cases that need to query data filtered by multiple dimensions.

Current behavior:

# This works - single dimension
./scripts/parquet-repartition --partition-by-owner ...
./scripts/parquet-repartition --tag-name "Drive-Id" ...

# This fails - multi-dimension
./scripts/parquet-repartition --partition-by-owner --tag-name "Drive-Id" ...
# Error: Cannot specify both --tag-name and --partition-by-owner

Use Cases

  1. Wallet + Application Analysis: Analyze a specific wallet's activity within a particular application (e.g., all ArDrive transactions for wallet X)

  2. Owner + Drive Analytics: Query all data items for a specific owner within a specific drive, enabling efficient per-user drive browsing

  3. Multi-tag Partitioning: Partition by multiple tags (e.g., App-Name + Content-Type) for application-specific content analysis

  4. Hierarchical Data Organization: Create datasets optimized for drill-down queries (e.g., start with owner, then filter by drive, then by content type)

Proposed Solution

Option A: Ordered Partition Keys (Recommended)

Support an ordered list of partition keys that creates nested directory structures:

./scripts/parquet-repartition \
  --input-dir data/datasets/default \
  --output-dir data/datasets/by-owner-drive \
  --partition-keys "owner_address,Drive-Id" \
  --min-occurrences 100

Output structure:

output/
└── transactions/data/
    └── owner_address=9_666Wkk2GzL.../
        ├── drive_id=abc123/
        │   └── *.parquet
        └── drive_id=xyz789/
            └── *.parquet

Option B: Composite Partition Key

Create a single partition level with composite keys:

./scripts/parquet-repartition \
  --partition-composite "owner_address,Drive-Id"

Output structure:

output/
└── transactions/data/
    └── owner_address=9_666Wkk2GzL...__drive_id=abc123/
        └── *.parquet

Option A is preferred as it enables partition pruning at each level.

Requirements

Must Have

  • Support 2+ partition dimensions in a single pass
  • Maintain Hive-style directory naming (key=value/)
  • Generate valid Apache Iceberg metadata for multi-level partitions
  • Preserve existing single-dimension functionality (backward compatible)

Should Have

  • Allow mixing owner-based and tag-based partition keys
  • Support --preserve-height as an additional (final) partition level
  • Configurable partition key ordering
  • Per-dimension filtering (e.g., --min-occurrences per dimension)

Could Have

  • Support for more than 2 custom dimensions (3+ level nesting)
  • Partition key validation (warn if cardinality is too high)
  • Automatic partition key ordering optimization based on cardinality

Technical Considerations

  1. Cardinality Management: High-cardinality combinations (e.g., owner x drive) could create millions of partitions. Need safeguards:

    • Require --min-occurrences for at least the leaf dimension
    • Warn when partition count exceeds threshold
    • Support --max-partitions at each level
  2. Memory Efficiency: Current height-chunking approach should extend to multi-dimensional processing

  3. Iceberg Schema: PyIceberg metadata generation needs updates to handle:

    • Multiple partition transforms
    • Nested partition specs
    • Proper field IDs for each partition column
  4. Query Performance: Document recommended partition key ordering (low cardinality first)

Example CLI Design

# Two-level: owner then drive
./scripts/parquet-repartition \
  --input-dir data/datasets/default \
  --output-dir data/datasets/analytics \
  --partition-keys "owner_address,Drive-Id" \
  --min-occurrences 50 \
  --generate-iceberg

# Three-level: owner, drive, then height
./scripts/parquet-repartition \
  --input-dir data/datasets/default \
  --output-dir data/datasets/analytics \
  --partition-keys "owner_address,Drive-Id" \
  --preserve-height \
  --height-partition-size 10000

# Two tags
./scripts/parquet-repartition \
  --input-dir data/datasets/default \
  --output-dir data/datasets/by-app-content \
  --partition-keys "App-Name,Content-Type" \
  --min-occurrences 100

Related Files

  • scripts/parquet-repartition - Main repartitioning script
  • scripts/parquet-export - Initial export (height partitioning)
  • scripts/generate-iceberg-metadata - Iceberg metadata generation
  • scripts/README-repartition.md - Documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions