Skip to content

Bump Arrow and Parquet to 59.0.0#278

Merged
kevinjqliu merged 1 commit into
clflushopt:mainfrom
kevinjqliu:kevinjqliu/codex-arrow-parquet-59
Jun 25, 2026
Merged

Bump Arrow and Parquet to 59.0.0#278
kevinjqliu merged 1 commit into
clflushopt:mainfrom
kevinjqliu:kevinjqliu/codex-arrow-parquet-59

Conversation

@kevinjqliu

@kevinjqliu kevinjqliu commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

This bumps the Arrow and Parquet dependencies from 58 to 59.

The only test updates are the Parquet row-group byte-size snapshots. Those numbers come from Parquet's physical metadata, and Parquet 59 changed the writer's page batching for variable-width columns in apache/arrow-rs#9972. That shifts a few encoded byte totals for string/comment columns, but the generated data still round-trips correctly.

Validation

  • cargo check --workspace --all-targets
  • cargo fmt --all -- --check
  • cargo test -p tpchgen-cli --test cli_integration
  • cargo test --workspace
  • cargo clippy -p tpchgen-arrow -p tpchgen-cli --all-targets -- -D warnings
  • git diff --check

Benchmark

I also reran the lineitem Parquet benchmark shape from the previous Arrow bump with hyperfine, using the current parquet subcommand and fresh output dirs. I ran it twice with the command order reversed, so the result is less sensitive to ordering/warmup noise. To reproduce locally, build the two release binaries and point these variables at them:

TPCHGEN_58=/path/to/tpchgen-cli-58.3.0
TPCHGEN_59=/path/to/tpchgen-cli-59.0.0
OUT=/tmp/tpchgen-bench-out

hyperfine --runs 5 \
  --prepare "rm -rf $OUT/sf100-p10-58 $OUT/sf100-p10-59" \
  "$TPCHGEN_58 parquet --scale-factor=100 --tables=lineitem --parts=10 --output-dir $OUT/sf100-p10-58" \
  "$TPCHGEN_59 parquet --scale-factor=100 --tables=lineitem --parts=10 --output-dir $OUT/sf100-p10-59"

hyperfine --runs 5 \
  --prepare "rm -rf $OUT/sf100-p10-59-rev $OUT/sf100-p10-58-rev" \
  "$TPCHGEN_59 parquet --scale-factor=100 --tables=lineitem --parts=10 --output-dir $OUT/sf100-p10-59-rev" \
  "$TPCHGEN_58 parquet --scale-factor=100 --tables=lineitem --parts=10 --output-dir $OUT/sf100-p10-58-rev"

The baseline binary resolved Arrow/Parquet to 58.3.0, and the upgraded binary resolved them to 59.0.0.

58 first:
  58.3.0: 32.109 s +/- 0.354 s  [range: 31.702 s ... 32.521 s]
  59.0.0: 33.235 s +/- 0.315 s  [range: 32.832 s ... 33.613 s]
  Summary: 58.3.0 ran 1.04 +/- 0.02x faster

59 first:
  59.0.0: 32.675 s +/- 0.325 s  [range: 32.348 s ... 33.176 s]
  58.3.0: 33.085 s +/- 0.283 s  [range: 32.805 s ... 33.557 s]
  Summary: 59.0.0 ran 1.01 +/- 0.01x faster

So I would read this as no material performance change on my loaded machine.
Combined across both orderings,

  • 58.3.0 averaged 32.597s
  • 59.0.0 averaged 32.955s

Output size was effectively unchanged too:

  • 27,146,187,982 bytes for 58.3.0
  • 27,146,169,702 bytes for 59.0.0

@kevinjqliu kevinjqliu marked this pull request as ready for review June 24, 2026 06:15
@kevinjqliu kevinjqliu requested review from alamb and clflushopt June 24, 2026 06:16
@kevinjqliu kevinjqliu changed the title Bump arrow-rs version to 59.0.0 Bump Arrow and Parquet to 59.0.0 Jun 24, 2026
@alamb

alamb commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

YAAAAS! I am testing this out locally

@alamb

alamb commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

My numbers are about the same as yours (basically no difference)

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --runs 5 --prepare "rm -rf out" "./tpchgen-cli-58.3 parquet --scale-factor=10 --tables=lineitem --parts=10 --output-dir out"
Benchmark 1: ./tpchgen-cli-58.3 parquet --scale-factor=10 --tables=lineitem --parts=10 --output-dir out
Time (mean ± σ): 3.861 s ± 0.041 s [User: 33.798 s, System: 0.819 s]
Range (min … max): 3.814 s … 3.915 s 5 runs

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --runs 5 --prepare "rm -rf out" "./tpchgen-cli-60 parquet --scale-factor=10 --tables=lineitem --parts=10 --output-dir out"
Benchmark 1: ./tpchgen-cli-60 parquet --scale-factor=10 --tables=lineitem --parts=10 --output-dir out
Time (mean ± σ): 3.850 s ± 0.032 s [User: 33.971 s, System: 0.831 s]
Range (min … max): 3.820 s … 3.891 s 5 runs

@alamb alamb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kevinjqliu

Comment thread tpchgen-cli/tests/cli_integration.rs
@kevinjqliu kevinjqliu merged commit 47f96a2 into clflushopt:main Jun 25, 2026
23 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/codex-arrow-parquet-59 branch June 25, 2026 00:32
@kevinjqliu

Copy link
Copy Markdown
Collaborator Author

Thanks for the review @alamb!

@clflushopt

Copy link
Copy Markdown
Owner

Thanks @kevinjqliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants