Skip to content

Ability to configure Compression, Sorting, FileSzie and Row Group size for output Parquet files #572

@lkzcgfvf

Description

@lkzcgfvf

Problem

Size of Parquet file very depends on data and how data stored inside file. Once file is in DataLake its hard to change it and experience of using large and not optimal generated Parquet file will lead to higher cost ( storage, network, CPU) and also show low performance.

Solution

Parquet spec provides different settings that could be used to make file more efficient like

  • different compression algorithms ( zstd, lz4, etc)
  • different encoding types ( plain/RLE dictionary, delta,e tc)
  • row group size

Its would be hard to implement all Parquet features but some of them that will bring a lot of value would be really important to have

Configure default values with ability to override these settings per table

  1. Specify compression codec
  2. Specify max Parquet file size
  3. Specify row group size
  4. Specify column and sort data based on this column

These settings will give more control and help to make output Parquet files less in size and more efficient to read which will bring better performance and lower infra costs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions