Skip to content

Conversation

@kevinjqliu
Copy link
Collaborator

@kevinjqliu kevinjqliu commented Nov 1, 2025

Closes #197 #198 #199

@kevinjqliu
Copy link
Collaborator Author

looks like we need to refactor get_column_writers since its now deprecated

@clflushopt
Copy link
Owner

I am wondering what the impact is of moving to 57 for all three deps

@kevinjqliu
Copy link
Collaborator Author

I am wondering what the impact is of moving to 57 for all three deps

in terms of performance?

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu and @clflushopt

I ran some unscientific benchmarks on my laptop and indeed it seems like the upgrade makes writing 10% slower for some reason 🤔

Benchmark

rm -rf lineitem && time tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
Release Time
main 0m25.369s
main 0m25.729s
this PR 0m28.516s
this PR 0m28.682s

I'll do some profiling and see if I can see any reason

@kevinjqliu
Copy link
Collaborator Author

Not scientific either, this shows improvements for this PR vs last release (2.0.1)
only changes since 2.0.1 release are related to github actions, v2.0.1...main

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli -s 10'
Benchmark 1: ./target/release/tpchgen-cli -s 10
  Time (mean ± σ):     729.7 ms ±   7.6 ms    [User: 702.9 ms, System: 20.1 ms]
  Range (min … max):   714.5 ms … 748.4 ms    30 runs
 
➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli -s 10'
Benchmark 1: uvx tpchgen-cli -s 10
  Time (mean ± σ):     825.4 ms ±  12.3 ms    [User: 706.8 ms, System: 29.4 ms]
  Range (min … max):   812.7 ms … 873.6 ms    30 runs

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

Not scientific either, this shows improvements for this PR vs last release (2.0.1) only changes since 2.0.1 release are related to github actions, v2.0.1...main

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli -s 10'
Benchmark 1: ./target/release/tpchgen-cli -s 10
  Time (mean ± σ):     729.7 ms ±   7.6 ms    [User: 702.9 ms, System: 20.1 ms]
  Range (min … max):   714.5 ms … 748.4 ms    30 runs
 
➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli -s 10'
Benchmark 1: uvx tpchgen-cli -s 10
  Time (mean ± σ):     825.4 ms ±  12.3 ms    [User: 706.8 ms, System: 29.4 ms]
  Range (min … max):   812.7 ms … 873.6 ms    30 runs

I think by default tpchgen-cli makes TBL files (not parquet) so this command is likely not testing any changes related to arrow/parquet

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

I filed a ticket upstream to investigate and will post my findings there

@kevinjqliu
Copy link
Collaborator Author

I think by default tpchgen-cli makes TBL files (not parquet) so this command is likely not testing any changes related to arrow/parquet

oh yea, good point. I did it agains with the same options you used above. This PR is faster than v2.0.1

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet'
Benchmark 1: ./target/release/tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
  Time (mean ± σ):     784.5 ms ±   3.4 ms    [User: 757.3 ms, System: 20.9 ms]
  Range (min … max):   777.0 ms … 791.4 ms    30 runs

➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet'
Benchmark 1: uvx tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
  Time (mean ± σ):     869.3 ms ±   5.3 ms    [User: 753.4 ms, System: 29.8 ms]
  Range (min … max):   860.0 ms … 882.8 ms    30 runs

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

I think we have figured out what was going on

@clflushopt
Copy link
Owner

I wanted to confirm similar numbers, I spent a little while thinking I was doing something wrong 😮‍💨

@alamb
Copy link
Collaborator

alamb commented Nov 5, 2025

I wanted to confirm similar numbers, I spent a little while thinking I was doing something wrong 😮‍💨

The good (amazing ❤️ ) news is that @etseidl has a fix already (will be released in 57.1.0):

I verified with the code in apache/arrow-rs#8786, this PR now has roughly similar performance to 56

It still seems a few percent slower (0m25.957s vs 0m25.378s) but I will file some issues upstream to further optimize things

@alamb
Copy link
Collaborator

alamb commented Nov 30, 2025

I reran the benchmarks with arrow 57.1.0 and the good news is that the difference is now much less

Release Time
main 0m25.369s
main 0m25.729s
this PR 0m26.552s
this PR 0m26.886s

But it is still like 4-5% slower

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/update-parquet-arrow-57 branch from c38f69b to 57e1b3b Compare December 20, 2025 03:45
@alamb
Copy link
Collaborator

alamb commented Dec 20, 2025

(welcome back @kevinjqliu ❤️ )

@kevinjqliu kevinjqliu changed the title chore(deps): update parquet/arrow/arrow-csv from 56 to 57 chore(deps): update parquet/arrow/arrow-csv from 56 to 57.1 Dec 22, 2025
@kevinjqliu
Copy link
Collaborator Author

updated this PR to use v57.1

I'm trying to reproduce the benchmark between using 56 (main), 57 (previous change), and 57.1 (current change).
So far, I haven't been able to reproduce the numbers alamb reported above...

@alamb were you running this command for the benchmark?

rm -rf lineitem && time tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet

@kevinjqliu
Copy link
Collaborator Author

Turns out I did not build the binaries correctly. Fixed it using this hyperfine benchmark script.

Now I'm seeing similar benchmark numbers:

📊 Hyperfine Benchmark Summary (v56.2.0 baseline)

Version Average Time (s) Difference from v56.2.0
v56.2.0 30.832 0.000 (0.00%)
v57.0.0 35.380 +4.548 (+14.76%)
v57.1.0 31.744 +0.912 (+2.96%)

@kevinjqliu
Copy link
Collaborator Author

@alamb @clflushopt i think this is good to go. WYDT?

@clflushopt
Copy link
Owner

@alamb @clflushopt i think this is good to go. WYDT?

+1 for bumping to 57.1.0

Copy link
Owner

@clflushopt clflushopt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @kevinjqliu

@alamb
Copy link
Collaborator

alamb commented Dec 22, 2025

@alamb @clflushopt i think this is good to go. WYDT?

I agree -- I think it is time to merge!

@alamb alamb merged commit 1f1bea1 into clflushopt:main Dec 22, 2025
21 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/update-parquet-arrow-57 branch December 22, 2025 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants