Efficiently reading from a parquet file is directly influenced by the number of row groups. For parallel computing engines, a parquet file uses row groups to read distinct chunks of data in parallel. In addition, parquet computes metadata statistics that can be used for predicate pushdown on each row group, which can enable avoid IO based solely on this metadata.
Currently, Pandas writes all the data in a single row group or picks such a large value that it limits parallelization and limits the effectiveness of filter pushdown. This blog post compares Pandas to Pyarrow directly and shows the impact of Pandas failing to set the row group size. We would like to see Pandas adopt a much smaller row group size, for example, a number equivalent to 100MB per row group size (or just 1 million rows if not feasible). This will make files written by Pandas more efficient to read by other compute engines.
Efficiently reading from a parquet file is directly influenced by the number of row groups. For parallel computing engines, a parquet file uses row groups to read distinct chunks of data in parallel. In addition, parquet computes metadata statistics that can be used for
predicate pushdownon each row group, which can enable avoid IO based solely on this metadata.Currently, Pandas writes all the data in a single row group or picks such a large value that it limits parallelization and limits the effectiveness of filter pushdown. This blog post compares Pandas to Pyarrow directly and shows the impact of Pandas failing to set the row group size. We would like to see Pandas adopt a much smaller row group size, for example, a number equivalent to 100MB per row group size (or just 1 million rows if not feasible). This will make files written by Pandas more efficient to read by other compute engines.