Skip to content

Enable access to column dictionaries in async reader #9010

@DarkWanderer

Description

@DarkWanderer

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Some databases, one example being Grafana Tempo, utilize column dictionaries as makeshift column indexes, to improve filtering speed ad-hoc. Checking if low-cardinality value is present in dictionary allows to effectively pre-filter data by skipping whole row group

Describe the solution you'd like

Add API to ParquetRecordBatchStreamBuilder that allows to inspect contents of the dictionary

Describe alternatives you've considered

Column indexes have high expected size cost and are not always available (e.g. for legacy data)

Additional context

It is possible to access this information in SerializedFileReader already by using "peekable" page iterator

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions