Description
Description
Background
When evaluating the compression ratio between the Velox parquet writer and the Presto parquet writer, I created a table with only one column of SMALLINT
type and inserted 10K random generated records. Both writers enabled the dictionary encoding by default, the Presto parquet writer seemed to trigger the fallback mechanism and switch to RLE/BIT_PACKED
encoding, while the Velox parquet writer didn't trigger the fallback mechanism implemented in the arrow
library here.
Issues
I found that there are three configurations hardcoded in the WriterOptions
here.
The first one enableDictionary
is hardcode to true
. Although there is code for handling non-dictionary case, but since here it is hardcoded, those code is never used. Or is it determined dynamically in the runtime? (I probably missed something).
The second one dataPageSize
is hardcoded to 1MB. However, in Presto, it is configurable here.
The third one dictionaryPageSizeLimit
is also hardcoded to 1MB (the default value used by arrow
is also 1MB, but it is weird to use a magic number here but no actual code using this variable), which is supposed to be used to trigger arrow
's fallback mechanism to ensure the dictionary won't go too large. But there are no other places using this variable except unit test, this might be the root case of the difference of the behaviours when I was evaluating compression ratio between Presto and Velox because they might use a different threshold.
Potential Enhancement
I think the enhancement would be simple as two steps:
- Step 1: Add three options in
hive.properties
to make theenableDictionary
,dataPageSize
anddictionaryPageSizeLimit
configurable in Velox. - Step 2: Correctly pass
dictionaryPageSizeLimit
to thearrow
API, so that the fallback mechanism can work correctly.
If it looks good I can submit a PR.