Skip to content

Conversation

@vamsi-op
Copy link

@vamsi-op vamsi-op commented Nov 5, 2025

Description

This PR implements the feature requested in #572 to provide configuration options for optimizing Parquet file output.

Changes Implemented

  1. Compression Codec Configuration

    • Added support for multiple compression algorithms: snappy (default), gzip, zstd, lz4, and uncompressed
    • Configurable via compression field in the writer config
    • Includes validation for supported codecs
  2. Maximum File Size Control

    • Added max_file_size configuration to control Parquet file size
    • Implements automatic file rotation when size limit is reached
    • Default: 128MB, can be disabled by setting to 0
    • Tracks approximate file size and record count per partition
  3. Row Group Size Configuration

    • Added row_group_size option for fine-tuning row group size
    • Allows users to balance compression vs memory usage
    • Uses parquet-go default if not specified
  4. Sort Columns (Prepared for future)

    • Added sort_columns field in config and spec
    • Marked for future implementation with TODO comments
    • Requires buffering and sorting logic to be implemented

Technical Details

  • Updated Config struct with new fields and validation
  • Refactored file closing logic into reusable closeAndUploadPartitionFile method
  • Enhanced Write method to check file size and rotate files automatically
  • Updated spec.json with new configuration options including descriptions and defaults
  • Added helper methods: getCompressionCodec(), getMaxFileSize(), getRowGroupSize()

Benefits

  • Smaller file sizes through optimized compression
  • Better query performance and reduced storage costs
  • Fine-grained control for specific use cases
  • Cost savings through reduced storage and network usage

Closes #572

@CLAassistant
Copy link

CLAassistant commented Nov 5, 2025

CLA assistant check
All committers have signed the CLA.

@vaibhav-datazip
Copy link
Collaborator

vaibhav-datazip commented Nov 5, 2025

@vamsi-op , please rebase your branch to staging, its currently raised on master

… Parquet files

- Add compression codec configuration (snappy, gzip, zstd, lz4, none)
- Implement max file size with automatic file rotation
- Add row group size configuration
- Prepare sort_columns field for future implementation
- Update spec.json with new configuration options
- Add validation for new config fields
- Refactor file closing logic for reusability

Closes datazip-inc#572
@vamsi-op vamsi-op force-pushed the feat/parquet-config-options branch from 8aa25fa to ee8a45a Compare November 5, 2025 16:34
@vamsi-op vamsi-op changed the base branch from master to staging November 5, 2025 16:34
@vamsi-op
Copy link
Author

vamsi-op commented Nov 5, 2025

@vaibhav-datazip Rebased to staging branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to configure Compression, Sorting, FileSzie and Row Group size for output Parquet files

3 participants