chore: upgrade arrow-rs from 57.1.0 to 59.0.0#7141
Conversation
c7c1260 to
b32cd08
Compare
Greptile SummaryUpgrades all
Confidence Score: 4/5Safe to merge; all changes are mechanical API adaptations driven by the arrow-rs 59.0.0 breaking changes, with no logic alterations. The upgrade is well-scoped and the Cargo.lock confirms all transitive dependencies resolve correctly. The one open question is whether passing the full row-group size as the internal batch_size hint to every array-reader constructor is optimal — it could cause larger-than-necessary buffer pre-allocations for row groups with millions of rows — but it does not affect correctness. src/daft-parquet/src/reader/field_reader.rs — specifically the choice of total_rows as the batch_size argument to all array reader constructors. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[decode_one_streaming chunk_size available] --> B[build_top_field_reader]
B --> C[build_primitive_leaf_reader total_rows = rg.num_rows]
C --> D{physical type}
D -->|BOOLEAN/INT32/INT64/FLOAT/DOUBLE/INT96| E[PrimitiveArrayReader::new pages col_descr arrow_type total_rows]
D -->|NULL| F[NullArrayReader::new pages col_descr total_rows]
D -->|BYTE_ARRAY Utf8View/BinaryView| G[make_byte_view_array_reader total_rows]
D -->|BYTE_ARRAY other| H[make_byte_array_reader total_rows]
D -->|FIXED_LEN_BYTE_ARRAY| I[make_fixed_len_byte_array_reader total_rows]
E & F & G & H & I --> J[Box dyn ArrayReader]
J --> K[read_records capped at chunk_size]
K --> L[consume_batch send ArrayRef]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[decode_one_streaming chunk_size available] --> B[build_top_field_reader]
B --> C[build_primitive_leaf_reader total_rows = rg.num_rows]
C --> D{physical type}
D -->|BOOLEAN/INT32/INT64/FLOAT/DOUBLE/INT96| E[PrimitiveArrayReader::new pages col_descr arrow_type total_rows]
D -->|NULL| F[NullArrayReader::new pages col_descr total_rows]
D -->|BYTE_ARRAY Utf8View/BinaryView| G[make_byte_view_array_reader total_rows]
D -->|BYTE_ARRAY other| H[make_byte_array_reader total_rows]
D -->|FIXED_LEN_BYTE_ARRAY| I[make_fixed_len_byte_array_reader total_rows]
E & F & G & H & I --> J[Box dyn ArrayReader]
J --> K[read_records capped at chunk_size]
K --> L[consume_batch send ArrayRef]
Reviews (1): Last reviewed commit: "chore: upgrade arrow-rs from 57.1.0 to 5..." | Re-trigger Greptile |
| let reader: Box<dyn ArrayReader> = if matches!(arrow_type, arrow::datatypes::DataType::Null) { | ||
| Box::new(NullArrayReader::<Int32Type>::new(pages, col_descr)?) | ||
| Box::new(NullArrayReader::<Int32Type>::new( | ||
| pages, col_descr, total_rows, | ||
| )?) | ||
| } else { | ||
| match physical_type { | ||
| PhysicalType::BOOLEAN => Box::new(PrimitiveArrayReader::<BoolType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT32 => Box::new(PrimitiveArrayReader::<Int32Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT64 => Box::new(PrimitiveArrayReader::<Int64Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::FLOAT => Box::new(PrimitiveArrayReader::<FloatType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::DOUBLE => Box::new(PrimitiveArrayReader::<DoubleType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT96 => Box::new(PrimitiveArrayReader::<Int96Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::BYTE_ARRAY => match arrow_type { | ||
| arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView => { | ||
| make_byte_view_array_reader(pages, col_descr, Some(arrow_type))? | ||
| make_byte_view_array_reader(pages, col_descr, Some(arrow_type), total_rows)? | ||
| } | ||
| _ => make_byte_array_reader(pages, col_descr, Some(arrow_type))?, | ||
| _ => make_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?, | ||
| }, | ||
| PhysicalType::FIXED_LEN_BYTE_ARRAY => { | ||
| make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type))? | ||
| make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)? | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
total_rows as batch_size may cause over-allocation
total_rows here is rg.num_rows() as usize — the full row-group size — yet the actual reads are chunked at the chunk_size level in decode_one_streaming. If batch_size controls internal buffer pre-allocation inside parquet-rs (as is typical), passing the full row-group count (potentially millions of rows) instead of the chunk_size hint could cause the reader to pre-allocate far more memory than is ever needed per decode cycle.
Consider threading chunk_size down through build_top_field_reader → build_primitive_leaf_reader so each reader pre-allocates only what will actually be consumed per batch.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
this seems like a reasonable point. wdyt?
Upgrade all arrow-rs workspace dependencies (arrow, arrow-array, arrow-buffer, arrow-csv, arrow-data, arrow-flight, arrow-ipc, arrow-json, arrow-row, arrow-schema, arrow-select, parquet) from 57.1.0 to 59.0.0. API changes addressed: - PrimitiveArrayReader::new, NullArrayReader::new, and byte array reader constructors now require a batch_size parameter - LogicalType::Timestamp and LogicalType::Decimal changed from struct variants to tuple variants wrapping TimestampType and DecimalType This upgrade brings native s390x-aware size assertions for the parquet-variant Variant enum, eliminating the need for architecture- specific patches. Also updates daft-ext to support arrow-59 feature alongside existing arrow-56/57/58, and updates serde_arrow feature from arrow-57 to arrow-59 in daft-sketch. Signed-off-by: Mike DePaulo <mdepaulo@redhat.com> Signed-off-by: Mike DePaulo <mikedep333@redhat.com>
b32cd08 to
ed618b6
Compare
srilman
left a comment
There was a problem hiding this comment.
overall, lgtm. just one thing that greptile points out
| let reader: Box<dyn ArrayReader> = if matches!(arrow_type, arrow::datatypes::DataType::Null) { | ||
| Box::new(NullArrayReader::<Int32Type>::new(pages, col_descr)?) | ||
| Box::new(NullArrayReader::<Int32Type>::new( | ||
| pages, col_descr, total_rows, | ||
| )?) | ||
| } else { | ||
| match physical_type { | ||
| PhysicalType::BOOLEAN => Box::new(PrimitiveArrayReader::<BoolType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT32 => Box::new(PrimitiveArrayReader::<Int32Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT64 => Box::new(PrimitiveArrayReader::<Int64Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::FLOAT => Box::new(PrimitiveArrayReader::<FloatType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::DOUBLE => Box::new(PrimitiveArrayReader::<DoubleType>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::INT96 => Box::new(PrimitiveArrayReader::<Int96Type>::new( | ||
| pages, | ||
| col_descr, | ||
| Some(arrow_type), | ||
| total_rows, | ||
| )?), | ||
| PhysicalType::BYTE_ARRAY => match arrow_type { | ||
| arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView => { | ||
| make_byte_view_array_reader(pages, col_descr, Some(arrow_type))? | ||
| make_byte_view_array_reader(pages, col_descr, Some(arrow_type), total_rows)? | ||
| } | ||
| _ => make_byte_array_reader(pages, col_descr, Some(arrow_type))?, | ||
| _ => make_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)?, | ||
| }, | ||
| PhysicalType::FIXED_LEN_BYTE_ARRAY => { | ||
| make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type))? | ||
| make_fixed_len_byte_array_reader(pages, col_descr, Some(arrow_type), total_rows)? | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
this seems like a reasonable point. wdyt?
Summary
Motivation
arrow-rs 59.0.0 includes an architecture-aware size assertion for the parquet-variant Variant enum, which fixes a compile-time failure on s390x (IBM Z). The previous assertion expected 80 bytes on all 64-bit platforms, but s390x produces 72 bytes due to different
alignment/padding rules on big-endian architectures.
Test plan