Skip to content

chore: upgrade arrow-rs from 57.1.0 to 59.0.0#7141

Open
mikedep333 wants to merge 1 commit into
Eventual-Inc:mainfrom
mikedep333:arrow-59-upgrade
Open

chore: upgrade arrow-rs from 57.1.0 to 59.0.0#7141
mikedep333 wants to merge 1 commit into
Eventual-Inc:mainfrom
mikedep333:arrow-59-upgrade

Conversation

@mikedep333

@mikedep333 mikedep333 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Upgrade all arrow-rs workspace dependencies from 57.1.0 to 59.0.0
  • Update Cargo.lock with surgical arrow/parquet-only dependency resolution
  • Fix daft-parquet API compatibility with arrow-rs 59.0.0:
    • Add batch_size parameter to PrimitiveArrayReader::new, NullArrayReader::new, make_byte_view_array_reader, make_byte_array_reader, and make_fixed_len_byte_array_reader calls
    • Update LogicalType::Timestamp and LogicalType::Decimal pattern matching from struct variants to tuple variants (wrapping TimestampType and DecimalType)
  • Update daft-sketch serde_arrow feature from arrow-57 to arrow-59
  • Add arrow-59 feature to daft-ext extension SDK for FFI compatibility

Motivation

arrow-rs 59.0.0 includes an architecture-aware size assertion for the parquet-variant Variant enum, which fixes a compile-time failure on s390x (IBM Z). The previous assertion expected 80 bytes on all 64-bit platforms, but s390x produces 72 bytes due to different
alignment/padding rules on big-endian architectures.

Test plan

  • CI passes on all platforms
  • Existing arrow-dependent tests (daft-sketch, daft-parquet, daft-ext) pass
  • s390x cross-compilation succeeds without parquet-variant patching

@mikedep333 mikedep333 requested a review from a team as a code owner June 18, 2026 18:33
@github-actions github-actions Bot added the chore label Jun 18, 2026
@mikedep333 mikedep333 marked this pull request as draft June 18, 2026 18:58
@mikedep333 mikedep333 force-pushed the arrow-59-upgrade branch 2 times, most recently from c7c1260 to b32cd08 Compare June 18, 2026 20:17
@mikedep333 mikedep333 marked this pull request as ready for review June 18, 2026 20:46
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Upgrades all arrow-rs workspace crates and parquet from 57.1.0 to 59.0.0, adapting three downstream crates (daft-parquet, daft-sketch, daft-ext) to the new API surface. The primary motivation is fixing a compile-time size assertion failure on s390x introduced in parquet-variant.

  • daft-parquet: migrates WriterProperties::set_max_row_group_sizeset_max_row_group_row_count(Some(...)) in benchmarks; adds the new required batch_size parameter to all PrimitiveArrayReader/NullArrayReader/byte-reader constructors (passing total_rows as the hint); updates LogicalType::Timestamp and LogicalType::Decimal destructuring to match the new tuple-variant API.
  • daft-sketch: advances serde_arrow's feature flag from arrow-57 to arrow-59; Cargo.lock confirms serde_arrow 0.14.2 resolves correctly against arrow 59.
  • daft-ext: adds the arrow-59 optional feature and extends the cfg-gated "prefer highest version" chain, following the existing pattern for arrow-56/57/58.

Confidence Score: 4/5

Safe to merge; all changes are mechanical API adaptations driven by the arrow-rs 59.0.0 breaking changes, with no logic alterations.

The upgrade is well-scoped and the Cargo.lock confirms all transitive dependencies resolve correctly. The one open question is whether passing the full row-group size as the internal batch_size hint to every array-reader constructor is optimal — it could cause larger-than-necessary buffer pre-allocations for row groups with millions of rows — but it does not affect correctness.

src/daft-parquet/src/reader/field_reader.rs — specifically the choice of total_rows as the batch_size argument to all array reader constructors.

Important Files Changed

Filename Overview
Cargo.toml Bumps all arrow-* workspace dependencies and parquet from 57.1.0 to 59.0.0; straightforward version-pin update.
src/daft-ext/Cargo.toml Adds arrow-59 optional feature (schema/data/array deps) and extends the ignored list, following the exact same pattern as arrow-56/57/58.
src/daft-ext/src/helpers.rs Extends the prefer-highest-version cfg chain to include arrow-59; all existing conditions correctly updated with not(feature = arrow-59) guards.
src/daft-parquet/src/reader/field_reader.rs Adds total_rows (row group size) as the new required batch_size parameter to PrimitiveArrayReader::new, NullArrayReader::new, and the byte reader helpers; may over-allocate for large row groups.
src/daft-parquet/src/statistics/column_range.rs Updates LogicalType::Timestamp and LogicalType::Decimal destructuring from struct variants to tuple variants to match arrow-rs 59 API.
src/daft-parquet/benches/parquet_read.rs Migrates set_max_row_group_size to set_max_row_group_row_count(Some(...)) in six bench helpers; correct API migration for parquet 59.0.0.
src/daft-sketch/Cargo.toml Updates serde_arrow feature from arrow-57 to arrow-59; Cargo.lock confirms serde_arrow 0.14.2 resolves arrow-array/schema 59.0.0 correctly.
src/daft-ext/src/abi/compat.rs Adds cfg(feature = arrow-59) impl_arrow_conversions! block and updates the doc comment; mirrors existing arrow-56/57/58 entries exactly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[decode_one_streaming chunk_size available] --> B[build_top_field_reader]
    B --> C[build_primitive_leaf_reader total_rows = rg.num_rows]
    C --> D{physical type}
    D -->|BOOLEAN/INT32/INT64/FLOAT/DOUBLE/INT96| E[PrimitiveArrayReader::new pages col_descr arrow_type total_rows]
    D -->|NULL| F[NullArrayReader::new pages col_descr total_rows]
    D -->|BYTE_ARRAY Utf8View/BinaryView| G[make_byte_view_array_reader total_rows]
    D -->|BYTE_ARRAY other| H[make_byte_array_reader total_rows]
    D -->|FIXED_LEN_BYTE_ARRAY| I[make_fixed_len_byte_array_reader total_rows]
    E & F & G & H & I --> J[Box dyn ArrayReader]
    J --> K[read_records capped at chunk_size]
    K --> L[consume_batch send ArrayRef]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[decode_one_streaming chunk_size available] --> B[build_top_field_reader]
    B --> C[build_primitive_leaf_reader total_rows = rg.num_rows]
    C --> D{physical type}
    D -->|BOOLEAN/INT32/INT64/FLOAT/DOUBLE/INT96| E[PrimitiveArrayReader::new pages col_descr arrow_type total_rows]
    D -->|NULL| F[NullArrayReader::new pages col_descr total_rows]
    D -->|BYTE_ARRAY Utf8View/BinaryView| G[make_byte_view_array_reader total_rows]
    D -->|BYTE_ARRAY other| H[make_byte_array_reader total_rows]
    D -->|FIXED_LEN_BYTE_ARRAY| I[make_fixed_len_byte_array_reader total_rows]
    E & F & G & H & I --> J[Box dyn ArrayReader]
    J --> K[read_records capped at chunk_size]
    K --> L[consume_batch send ArrayRef]
Loading

Reviews (1): Last reviewed commit: "chore: upgrade arrow-rs from 57.1.0 to 5..." | Re-trigger Greptile

Comment on lines 103 to 155
let reader: Box<dyn ArrayReader> = if matches!(arrow_type, arrow::datatypes::DataType::Null) {
Box::new(NullArrayReader::<Int32Type>::new(pages, col_descr)?)
Box::new(NullArrayReader::<Int32Type>::new(
pages, col_descr, total_rows,
)?)
} else {
match physical_type {
PhysicalType::BOOLEAN => Box::new(PrimitiveArrayReader::<BoolType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT32 => Box::new(PrimitiveArrayReader::<Int32Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT64 => Box::new(PrimitiveArrayReader::<Int64Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::FLOAT => Box::new(PrimitiveArrayReader::<FloatType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::DOUBLE => Box::new(PrimitiveArrayReader::<DoubleType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT96 => Box::new(PrimitiveArrayReader::<Int96Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::BYTE_ARRAY => match arrow_type {
arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView => {
make_byte_view_array_reader(pages, col_descr, Some(arrow_type))?
make_byte_view_array_reader(pages, col_descr, Some(arrow_type), total_rows)?
}
_ => make_byte_array_reader(pages, col_descr, Some(arrow_type))?,
_ => make_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?,
},
PhysicalType::FIXED_LEN_BYTE_ARRAY => {
make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type))?
make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?
}
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 total_rows as batch_size may cause over-allocation

total_rows here is rg.num_rows() as usize — the full row-group size — yet the actual reads are chunked at the chunk_size level in decode_one_streaming. If batch_size controls internal buffer pre-allocation inside parquet-rs (as is typical), passing the full row-group count (potentially millions of rows) instead of the chunk_size hint could cause the reader to pre-allocate far more memory than is ever needed per decode cycle.

Consider threading chunk_size down through build_top_field_readerbuild_primitive_leaf_reader so each reader pre-allocates only what will actually be consumed per batch.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a reasonable point. wdyt?

Upgrade all arrow-rs workspace dependencies (arrow, arrow-array,
arrow-buffer, arrow-csv, arrow-data, arrow-flight, arrow-ipc,
arrow-json, arrow-row, arrow-schema, arrow-select, parquet) from
57.1.0 to 59.0.0.

API changes addressed:
- PrimitiveArrayReader::new, NullArrayReader::new, and byte array
  reader constructors now require a batch_size parameter
- LogicalType::Timestamp and LogicalType::Decimal changed from struct
  variants to tuple variants wrapping TimestampType and DecimalType

This upgrade brings native s390x-aware size assertions for the
parquet-variant Variant enum, eliminating the need for architecture-
specific patches.

Also updates daft-ext to support arrow-59 feature alongside existing
arrow-56/57/58, and updates serde_arrow feature from arrow-57 to
arrow-59 in daft-sketch.

Signed-off-by: Mike DePaulo <mdepaulo@redhat.com>
Signed-off-by: Mike DePaulo <mikedep333@redhat.com>

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, lgtm. just one thing that greptile points out

Comment on lines 103 to 155
let reader: Box<dyn ArrayReader> = if matches!(arrow_type, arrow::datatypes::DataType::Null) {
Box::new(NullArrayReader::<Int32Type>::new(pages, col_descr)?)
Box::new(NullArrayReader::<Int32Type>::new(
pages, col_descr, total_rows,
)?)
} else {
match physical_type {
PhysicalType::BOOLEAN => Box::new(PrimitiveArrayReader::<BoolType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT32 => Box::new(PrimitiveArrayReader::<Int32Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT64 => Box::new(PrimitiveArrayReader::<Int64Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::FLOAT => Box::new(PrimitiveArrayReader::<FloatType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::DOUBLE => Box::new(PrimitiveArrayReader::<DoubleType>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::INT96 => Box::new(PrimitiveArrayReader::<Int96Type>::new(
pages,
col_descr,
Some(arrow_type),
total_rows,
)?),
PhysicalType::BYTE_ARRAY => match arrow_type {
arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView => {
make_byte_view_array_reader(pages, col_descr, Some(arrow_type))?
make_byte_view_array_reader(pages, col_descr, Some(arrow_type), total_rows)?
}
_ => make_byte_array_reader(pages, col_descr, Some(arrow_type))?,
_ => make_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?,
},
PhysicalType::FIXED_LEN_BYTE_ARRAY => {
make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type))?
make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?
}
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a reasonable point. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants