Open
Description
To make improving performance more measurable, include benchmarks to be run.
Requires benchmark programs (see https://github.com/apache/arrow-rs/tree/master/parquet/benches)
And also large data files, ideally with all supported data types
Note for the data files, completely random data may not be sufficient, as some encodings take advantage of patterns in the data (e.g. int v2 RLE), so need to keep that in mind if considering generating data for the benchmarks
Could also use something like TPCH or TPCDS data, or NYC taxi, for more variety in data