-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
Add Parquet File Compaction to Automated Export/ETL Pipeline
Problem
The automated Parquet export pipeline (clickhouse-auto-import) creates files incrementally based on height intervals (default: 10,000 blocks). Over time, this can result in:
- Many small Parquet files accumulating in each partition
- Suboptimal query performance (more files = more I/O overhead)
- Inefficient storage utilization compared to properly-sized files
- No consolidation when the same height range is re-exported
The recommended file size for analytics workloads is typically 128MB-256MB per file, but incremental exports may produce files much smaller or with inconsistent sizes.
Current Behavior
clickhouse-auto-importexports data in fixed height intervals- Files are organized by height partitions (e.g.,
height=0-9999/) CLICKHOUSE_AUTO_IMPORT_MAX_ROWS_PER_FILEcontrols max rows but not minimum- Files are written once and never consolidated
- No mechanism to merge small files or optimize file layout
Proposed Solution
Add automatic compaction as part of the export/ETL process:
Compaction Strategy Options
- Post-export compaction: After each export cycle, check partitions for compaction opportunities
- Periodic compaction: Run compaction on a separate schedule (e.g., daily)
- Threshold-based: Compact when partition exceeds N files or total size thresholds
Compaction Logic
- Merge multiple small Parquet files within a partition into larger consolidated files
- Target file size: configurable (e.g., 128MB-256MB or row count based)
- Preserve partitioning structure (height-based)
- Use DuckDB for efficient read/write operations (already a dependency)
- Atomic replacement: write new files, then remove old ones
Configuration Options
# Enable/disable compaction
PARQUET_COMPACTION_ENABLED=true
# Target file size in bytes (default: 256MB)
PARQUET_COMPACTION_TARGET_SIZE=268435456
# Minimum files in partition before considering compaction
PARQUET_COMPACTION_MIN_FILES=5
# Compaction schedule (if separate from export)
PARQUET_COMPACTION_INTERVAL=86400Implementation Considerations
- Locking/Concurrency: Ensure compaction doesn't interfere with ongoing exports or ClickHouse imports
- Iceberg Metadata: Update Iceberg metadata after compaction if enabled
- Disk Space: Compaction temporarily requires ~2x partition space during file rewrite
- Progress Tracking: Track which partitions have been compacted to avoid redundant work
- Error Recovery: Handle partial compaction failures gracefully
Acceptance Criteria
- Compaction can be enabled/disabled via environment variable
- Configurable target file size and compaction thresholds
- Compaction integrates with existing
clickhouse-auto-importpipeline - Iceberg metadata is updated after compaction (when enabled)
- Compaction is idempotent and safe to run concurrently with exports
- Logging provides visibility into compaction activity and results
- Documentation updated with new configuration options
Related
scripts/clickhouse-auto-import- Automated export pipelinescripts/parquet-export- Base export scriptscripts/parquet-repartition- Repartitioning (similar file manipulation patterns)
coderabbitai
Metadata
Metadata
Assignees
Labels
No labels