Skip to content

Add Parquet File Compaction to Automated Export/ETL Pipeline #574

@djwhitt

Description

@djwhitt

Add Parquet File Compaction to Automated Export/ETL Pipeline

Problem

The automated Parquet export pipeline (clickhouse-auto-import) creates files incrementally based on height intervals (default: 10,000 blocks). Over time, this can result in:

  • Many small Parquet files accumulating in each partition
  • Suboptimal query performance (more files = more I/O overhead)
  • Inefficient storage utilization compared to properly-sized files
  • No consolidation when the same height range is re-exported

The recommended file size for analytics workloads is typically 128MB-256MB per file, but incremental exports may produce files much smaller or with inconsistent sizes.

Current Behavior

  • clickhouse-auto-import exports data in fixed height intervals
  • Files are organized by height partitions (e.g., height=0-9999/)
  • CLICKHOUSE_AUTO_IMPORT_MAX_ROWS_PER_FILE controls max rows but not minimum
  • Files are written once and never consolidated
  • No mechanism to merge small files or optimize file layout

Proposed Solution

Add automatic compaction as part of the export/ETL process:

Compaction Strategy Options

  1. Post-export compaction: After each export cycle, check partitions for compaction opportunities
  2. Periodic compaction: Run compaction on a separate schedule (e.g., daily)
  3. Threshold-based: Compact when partition exceeds N files or total size thresholds

Compaction Logic

  • Merge multiple small Parquet files within a partition into larger consolidated files
  • Target file size: configurable (e.g., 128MB-256MB or row count based)
  • Preserve partitioning structure (height-based)
  • Use DuckDB for efficient read/write operations (already a dependency)
  • Atomic replacement: write new files, then remove old ones

Configuration Options

# Enable/disable compaction
PARQUET_COMPACTION_ENABLED=true

# Target file size in bytes (default: 256MB)
PARQUET_COMPACTION_TARGET_SIZE=268435456

# Minimum files in partition before considering compaction
PARQUET_COMPACTION_MIN_FILES=5

# Compaction schedule (if separate from export)
PARQUET_COMPACTION_INTERVAL=86400

Implementation Considerations

  1. Locking/Concurrency: Ensure compaction doesn't interfere with ongoing exports or ClickHouse imports
  2. Iceberg Metadata: Update Iceberg metadata after compaction if enabled
  3. Disk Space: Compaction temporarily requires ~2x partition space during file rewrite
  4. Progress Tracking: Track which partitions have been compacted to avoid redundant work
  5. Error Recovery: Handle partial compaction failures gracefully

Acceptance Criteria

  • Compaction can be enabled/disabled via environment variable
  • Configurable target file size and compaction thresholds
  • Compaction integrates with existing clickhouse-auto-import pipeline
  • Iceberg metadata is updated after compaction (when enabled)
  • Compaction is idempotent and safe to run concurrently with exports
  • Logging provides visibility into compaction activity and results
  • Documentation updated with new configuration options

Related

  • scripts/clickhouse-auto-import - Automated export pipeline
  • scripts/parquet-export - Base export script
  • scripts/parquet-repartition - Repartitioning (similar file manipulation patterns)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions