Skip to content

Set row_group_size parameter automatically in read_parquet  #4

@ehariri

Description

@ehariri

Efficiently reading from a parquet file is directly influenced by the number of row groups. For parallel computing engines, a parquet file uses row groups to read distinct chunks of data in parallel. In addition, parquet computes metadata statistics that can be used for predicate pushdown on each row group, which can enable avoid IO based solely on this metadata.

Currently, Pandas writes all the data in a single row group or picks such a large value that it limits parallelization and limits the effectiveness of filter pushdown. This blog post compares Pandas to Pyarrow directly and shows the impact of Pandas failing to set the row group size. We would like to see Pandas adopt a much smaller row group size, for example, a number equivalent to 100MB per row group size (or just 1 million rows if not feasible). This will make files written by Pandas more efficient to read by other compute engines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions