Description
Previous discussion: apache/datafusion#4707
Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.
Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.
- primitive data types (ORC refs)
- tiny int feat: support to read tinyint datafusion-orc#22
- timestamp with local time zone Timestamp instant support datafusion-orc#13
- decimal Decimal support datafusion-orc#18
- common compress methods Support all compression methods datafusion-orc#10
- user metadata
- other encodings Support for Int RLE v1 encoding datafusion-orc#24
- Benchmark Benchmarks datafusion-orc#8
The below are also related but with lower priorities
- compound data types Compound type support datafusion-orc#14
- struct feat: support to struct datatype datafusion-orc#26
- list
- map
- union
- file metadata and statistics
- pruning Support selection pruning #17
Long term items:
- encryption
Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.