Skip to content

Make the parquet writer option configurable #12734

Open
@anlowee

Description

@anlowee

Description

Background

When evaluating the compression ratio between the Velox parquet writer and the Presto parquet writer, I created a table with only one column of SMALLINT type and inserted 10K random generated records. Both writers enabled the dictionary encoding by default, the Presto parquet writer seemed to trigger the fallback mechanism and switch to RLE/BIT_PACKED encoding, while the Velox parquet writer didn't trigger the fallback mechanism implemented in the arrow library here.

Issues

I found that there are three configurations hardcoded in the WriterOptions here.

The first one enableDictionary is hardcode to true. Although there is code for handling non-dictionary case, but since here it is hardcoded, those code is never used. Or is it determined dynamically in the runtime? (I probably missed something).

The second one dataPageSize is hardcoded to 1MB. However, in Presto, it is configurable here.

The third one dictionaryPageSizeLimit is also hardcoded to 1MB (the default value used by arrow is also 1MB, but it is weird to use a magic number here but no actual code using this variable), which is supposed to be used to trigger arrow's fallback mechanism to ensure the dictionary won't go too large. But there are no other places using this variable except unit test, this might be the root case of the difference of the behaviours when I was evaluating compression ratio between Presto and Velox because they might use a different threshold.

Potential Enhancement

I think the enhancement would be simple as two steps:

  • Step 1: Add three options in hive.properties to make the enableDictionary, dataPageSize and dictionaryPageSizeLimit configurable in Velox.
  • Step 2: Correctly pass dictionaryPageSizeLimit to the arrow API, so that the fallback mechanism can work correctly.
    If it looks good I can submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions