Set `row_group_size` parameter automatically in `read_parquet` 

Efficiently reading from a parquet file is directly influenced by the number of row groups. For parallel computing engines, a parquet file uses row groups to read distinct chunks of data in parallel. In addition, parquet computes metadata statistics that can be used for `predicate pushdown` on each row group, which can enable avoid IO based solely on this metadata.

Currently, Pandas writes all the data in a single row group or picks such a large value that it limits parallelization and limits the effectiveness of filter pushdown. This [blog post](http://peter-hoffmann.com/2020/understand-predicate-pushdown-on-rowgroup-level-in-parquet-with-pyarrow-and-python.html) compares Pandas to Pyarrow directly and shows the impact of Pandas failing to set the row group size. We would like to see Pandas adopt a much smaller row group size, for example, a number equivalent to 100MB per row group size (or just 1 million rows if not feasible). This will make files written by Pandas more efficient to read by other compute engines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `row_group_size` parameter automatically in `read_parquet` #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Set row_group_size parameter automatically in read_parquet #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Set `row_group_size` parameter automatically in `read_parquet` #4