Make the parquet writer option configurable

### Description

### Background
When evaluating the compression ratio between the Velox parquet writer and the Presto parquet writer, I created a table with only one column of `SMALLINT` type and inserted 10K random generated records. Both writers enabled the dictionary encoding by default, the Presto parquet writer seemed to trigger the fallback mechanism and switch to `RLE/BIT_PACKED` encoding, while the Velox parquet writer didn't trigger the fallback mechanism implemented in the `arrow` library [here](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1683-L1696).

### Issues
I found that there are three configurations hardcoded in the `WriterOptions` [here](https://github.com/facebookincubator/velox/blob/main/velox/dwio/parquet/writer/Writer.h#L91-L93). 

The first one `enableDictionary` is hardcode to `true`. Although there is code for handling non-dictionary case, but since here it is hardcoded, those code is never used. Or is it determined dynamically in the runtime? (I probably missed something).

The second one `dataPageSize` is hardcoded  to 1MB. However, in Presto, it is configurable [here](https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/ParquetFileWriterConfig.java#L49-L59).

The third one `dictionaryPageSizeLimit` is also hardcoded to 1MB (the default value used by `arrow` is also [1MB](https://github.com/facebookincubator/velox/blob/main/velox/dwio/parquet/writer/arrow/Properties.h#L209), but it is weird to use a magic number here but no actual code using this variable), which is supposed to be used to trigger `arrow`'s fallback mechanism to ensure the dictionary won't go too large. But **there are no other places using this variable** except unit test, this might be the root case of the difference of the behaviours when I was evaluating compression ratio between Presto and Velox because they might use a different threshold.

### Potential Enhancement
I think the enhancement would be simple as two steps:
- **Step 1**: Add three options in `hive.properties` to make the `enableDictionary`, `dataPageSize` and `dictionaryPageSizeLimit` configurable in Velox.
- **Step 2**: Correctly pass `dictionaryPageSizeLimit` to the `arrow` API, so that the fallback mechanism can work correctly.
If it looks good I can submit a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the parquet writer option configurable #12734

Description

Background

Issues

Potential Enhancement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make the parquet writer option configurable #12734

Description

Description

Background

Issues

Potential Enhancement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions