Skip to content

Write ORC from arrow recordbatches #15

Open
@Jefffrey

Description

@Jefffrey

Not a focus now, just raising issue here for tracking

Currently in progress.

Initial support

Tracked by initial-write-support branch

  • Merged to main, further development done directly to main

Checklist:

  • High level ArrowWriter synchronous interface (accepts RecordBatches to write)
  • Basic configuration via builder
  • Stripe writer
  • Metadata writer
  • Value encoding
    • Integer RLEv2
      • Short repeat
      • Direct
      • Delta
      • Patched base
    • Base 128 varint
    • Byte RLE
  • Encode nullability
  • Float/Double array
  • Short/Int/Long array
  • String/Binary array
  • Boolean array
  • Byte array
  • Basic struct array support (for root)

Once complete will raise PR for all the above, to provide a complete and usable writer (though lacking in features see below).

Subsequent features

Following items will be added in smaller PRs once base code of writer is merged to main.

  • Asynchronous interface
  • Compression
    • Zlib
    • Snappy
    • Lzo
    • Lz4
    • Zstd
  • Statistics
    • Int
    • Double
    • String
    • Bucket
    • Decimal
    • Date
    • Binary
    • Timestamp
  • Dictionary array
  • Run length array
  • Decimal array
  • Date array
  • Timestamp array
  • Compound array
    • Union array
    • Map array
    • List array
    • Struct array
  • Index streams
    • Row group index
    • Bloom filters
  • Extension configuration (see Java config for examples)
  • User metadata
  • Arrow type hint (when writing with this Arrow -> ORC writer, encode the original Arrow type in metadata so when reading, we can recreate original Arrow array)
  • TODO: other Arrow types

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions