Problem
Size of Parquet file very depends on data and how data stored inside file. Once file is in DataLake its hard to change it and experience of using large and not optimal generated Parquet file will lead to higher cost ( storage, network, CPU) and also show low performance.
Solution
Parquet spec provides different settings that could be used to make file more efficient like
- different compression algorithms ( zstd, lz4, etc)
- different encoding types ( plain/RLE dictionary, delta,e tc)
- row group size
Its would be hard to implement all Parquet features but some of them that will bring a lot of value would be really important to have
Configure default values with ability to override these settings per table
- Specify compression codec
- Specify max Parquet file size
- Specify row group size
- Specify column and sort data based on this column
These settings will give more control and help to make output Parquet files less in size and more efficient to read which will bring better performance and lower infra costs.