Skip to content

JuliaCN/arrow-julia

Repository files navigation

Arrow

docs CI codecov

deps version pkgeval

This is a pure Julia implementation of the Apache Arrow data standard. This package provides Julia AbstractVector objects for referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.

Please see this document for a description of the Arrow memory layout.

Installation

The package can be installed by typing in the following in a Julia REPL:

julia> using Pkg; Pkg.add("Arrow")

Arrow.jl currently requires Julia 1.12+.

Local Development

When developing on Arrow.jl it is recommended that you run the following to ensure that any changes to ArrowTypes.jl are immediately available to Arrow.jl without requiring a release:

julia --project -e 'using Pkg; Pkg.develop(path="src/ArrowTypes")'

Current write-path notes:

  • Arrow.tobuffer includes a direct single-partition fast path for eligible inputs
  • Arrow.tobuffer(Tables.partitioner(...)) also includes a targeted direct multi-record-batch path for single-column top-level strings and single-column non-missing binary/code-units columns
  • Arrow.write(io, Tables.partitioner(...)) now reuses that same targeted direct multi-record-batch path instead of always going through the legacy Writer orchestration
  • multi-column partitions, dictionary-encoded top-level columns, map-heavy inputs, and missing-binary partitions retain the existing writer path

Format Support

This implementation supports the 1.0 version of the specification, including support for:

  • All primitive data types
  • All nested data types
  • Dictionary encodings and messages
  • Dictionary-encoded CategoricalArray interop, including missing-value roundtrips through Arrow.Table, copy, and DataFrame(...; copycols=true)
  • Extension types
  • Lightweight schema/field metadata overlays via Arrow.withmetadata(...) for Tables.jl-compatible sources before serialization
  • Base Julia Enum logical types via the JuliaLang.Enum extension label, with native Julia roundtrips back to the original enum type while convert=false and non-Julia consumers still see the primitive storage type
  • View-backed Utf8/Binary columns, including recovery from under-reported variadic buffer counts by inferring the required external buffers from valid view elements
  • Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn't include support for:

  • Tensor or sparse tensor IPC payload semantics; Arrow.jl now recognizes those message headers explicitly and rejects them with precise errors instead of falling through to a generic unsupported-message path
  • C data interface
  • Writing Run-End Encoded arrays; Arrow.jl now reads REE arrays and exposes them as read-only vectors, but still rejects REE on write paths

Flight RPC status:

  • Experimental Arrow.Flight support is available in-tree
  • Requires Julia 1.12+
  • Includes generated protocol bindings and complete client constructors for the FlightService RPC surface
  • Keeps the top-level Flight module shell thin, with exports and generated-protocol setup split out of src/flight/Flight.jl
  • Includes high-level FlightData <-> Arrow IPC helpers for Arrow.Table, Arrow.Stream, and DoPut/DoExchange payload generation, Arrow.Flight.pathdescriptor(...) for PATH descriptors without manual proto assembly, opt-in app_metadata surfacing through include_app_metadata=true on Arrow.Flight.stream(...) / Arrow.Flight.table(...), explicit batch-wise app_metadata=... emission on Arrow.Flight.flightdata(...), Arrow.Flight.putflightdata!(...), and source-based Arrow.Flight.doexchange(...), and a reusable Arrow.Flight.withappmetadata(...) wrapper so source-level batch metadata can stay attached without manual keyword threading
  • Keeps the Flight IPC conversion layer modular under src/flight/convert/, with src/flight/convert.jl retained as a thin entrypoint
  • Includes client helpers for request headers, binary metadata, handshake token reuse, and TLS configuration via withheaders, withtoken, and authenticate
  • Keeps the Flight client implementation modular under src/flight/client/, with thin entrypoints at src/flight/client.jl and src/flight/client/rpc_methods.jl
  • Includes a transport-agnostic server core (Service, ServerCallContext, ServiceDescriptor, MethodDescriptor) for local Flight method dispatch, path lookup, handler testing, high-level DoExchange assembly through Arrow.Flight.exchangeservice(...), Arrow.Flight.tableservice(...), and Arrow.Flight.streamservice(...), and source-based local invocation through Arrow.Flight.doexchange(service, context, source; ...), Arrow.Flight.table(service, context, source; ...), and Arrow.Flight.stream(service, context, source; ...)
  • Keeps the transport-agnostic server core modular under src/flight/server/, with src/flight/server.jl retained as a thin entrypoint
  • Includes an optional gRPCServer.jl package extension that maps Arrow.Flight.Service into gRPCServer.ServiceDescriptor and registers Flight proto types with the external server package when it is present
  • Keeps the optional gRPCServer.jl bridge modular under ext/arrowgrpcserverext/, with ext/ArrowgRPCServerExt.jl retained as a thin entrypoint
  • Includes optional live interoperability coverage for Handshake, authenticated token propagation, PollFlightInfo, and TLS via dedicated Python reference servers
  • Includes optional live pyarrow.flight interoperability coverage for ListFlights, GetFlightInfo, GetSchema, DoGet, DoPut, DoExchange, ListActions, and DoAction
  • Keeps targeted Flight verification modular under test/flight/, with test/flight.jl retained as a thin entrypoint for local and CI invocation stability, the client-constructor/protocol-wrapper checks decomposed under test/flight/client_surface/, the optional gRPCServer extension scenarios decomposed under test/flight/grpcserver_extension/, the pyarrow.flight interop scenarios decomposed under test/flight/pyarrow_interop/, and the transport-agnostic server-core checks decomposed under test/flight/server_core/
  • Includes test/flight_grpcserver.jl as a temporary-environment runner for optional native gRPCServer coverage without mutating test/Project.toml
  • Dedicated CI jobs now exercise the Flight interop suite on stable and nightly Linux; native Julia server transport remains optional/experimental and is not part of the default Flight suite

Third-party data formats:

Canonical extension highlights:

  • UUID now writes the canonical arrow.uuid extension name by default while retaining reader compatibility with legacy JuliaLang.UUID metadata
  • Arrow.TimestampWithOffset{U} provides a canonical arrow.timestamp_with_offset logical type without conflating offset-only semantics with ZonedDateTime
  • Arrow.Bool8 provides an explicit opt-in writer/reader surface for the canonical arrow.bool8 extension without changing the default packed-bit Bool path
  • Arrow.JSONText{String} provides a text-backed logical type for the canonical arrow.json extension without parsing payloads during read or write
  • arrow.opaque now reads as the underlying storage type without warning, and explicit writer metadata can be generated with Arrow.opaquemetadata(type_name, vendor_name)
  • Arrow.variantmetadata(), Arrow.fixedshapetensormetadata(...), and Arrow.variableshapetensormetadata(...) generate canonical metadata strings for advanced canonical extensions
  • arrow.fixed_shape_tensor and arrow.variable_shape_tensor are recognized on read as canonical passthrough extensions over their storage types, and Arrow.jl now validates their canonical metadata plus top-level storage shape before accepting them
  • arrow.parquet.variant is recognized on read as a canonical passthrough extension over its storage type; Arrow.jl currently validates that its canonical metadata is the required empty string, but does not yet implement deeper variant semantics or an automatic writer surface
  • Legacy JuliaLang.ZonedDateTime-UTC and JuliaLang.ZonedDateTime files remain readable for backward compatibility

See the full documentation for details on reading and writing arrow data.

About

Fork offical Julia implementation of Apache Arrow, for real production [maintainer @GTrunSec]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors