Ability to configure Compression, Sorting, FileSzie and Row Group size for output Parquet files

### Problem

Size of Parquet file very depends on data and how data stored inside file. Once file is in DataLake its hard to change it and experience of using large and not optimal generated Parquet file will lead to higher cost ( storage, network, CPU) and also show low performance.

### Solution

Parquet spec provides different settings that could be used to make file more efficient like  
- different compression algorithms ( zstd, lz4, etc)
- different  encoding types ( plain/RLE dictionary, delta,e tc) 
- row group size

Its would be hard to implement all Parquet features but some of them that will bring a lot of value would be really important to have

Configure default values with ability to override these settings per table
1. Specify compression codec
2. Specify max Parquet file size
3. Specify row group size
4. Specify column and sort data  based on this column

These settings will give more control and help to make output Parquet files less in size and more efficient to read which will bring better performance and lower infra costs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ability to configure Compression, Sorting, FileSzie and Row Group size for output Parquet files #572

Problem

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ability to configure Compression, Sorting, FileSzie and Row Group size for output Parquet files #572

Description

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions