Skip to content

Alamb/smaller row groups#151

Closed
alamb wants to merge 2 commits into
clflushopt:mainfrom
alamb:alamb/smaller_row_groups
Closed

Alamb/smaller row groups#151
alamb wants to merge 2 commits into
clflushopt:mainfrom
alamb:alamb/smaller_row_groups

Conversation

@alamb

@alamb alamb commented Jun 30, 2025

Copy link
Copy Markdown
Collaborator

This is a PR to test if making smaller row groups solves the problem reported in #150

I tested this PR using

RUST_LOG=debug cargo run --release --  -s 100000 --format=parquet --tables lineitem --parts 100000 --part 1

Before this PR the output parquet file has a single row group

After this PR the output parquet file has 6 row groups (which makes sense for 6M row file)

> select count(*) from 'lineitem-1.parquet';
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+
1 row(s) fetched.
Elapsed 0.051 seconds.

BTW you can look at the output parquet using the very cool Parquet Viewer from @XiangpengHao : https://parquet-viewer.xiangpeng.systems/

@mrendi29

mrendi29 commented Jun 30, 2025

Copy link
Copy Markdown

The Parquet viewer looks really neat!

I can confirm that the PR fixes the issue of having a single rowgroup:

root@localhost:~/tpchgen-rs# RUST_LOG=debug cargo run --release --  -s 750000 --format=parquet --tables lineitem --parts 150000 --part 1 --parquet-compression 'zstd(3)'
    Finished `release` profile [optimized] target(s) in 0.09s
     Running `target/release/tpchgen-cli -s 750000 --format=parquet --tables lineitem --parts 150000 --part 1 --parquet-compression 'zstd(3)'`
[2025-06-30T21:35:01Z DEBUG tpchgen_cli] Logging configured from environment variables
[2025-06-30T21:35:01Z DEBUG tpchgen_cli] Creating distributions and text pool
[2025-06-30T21:35:02Z INFO  tpchgen_cli] Created static distributions and text pools in 1.275865785s
[2025-06-30T21:35:02Z INFO  tpchgen_cli] Writing table lineitem (SF=750000) to lineitem.parquet
[2025-06-30T21:35:02Z DEBUG tpchgen_cli] Generating 150000 parts in total
[2025-06-30T21:35:02Z DEBUG tpchgen_cli::parquet] Generating Parquet with 56 threads, using ZSTD(ZstdLevel(3)) compression
[2025-06-30T21:35:36Z INFO  tpchgen_cli::statistics] Created 0.89 GB in 34.424730852s (0.03 GB/sec)
[2025-06-30T21:35:36Z DEBUG tpchgen_cli::statistics] Wrote 950384341 bytes in 29 row groups  31.25 MB/row groups
[2025-06-30T21:35:36Z INFO  tpchgen_cli] Generation complete!

root@localhost:~/tpchgen-rs# parquet-tools inspect lineitem.parquet  | head
############ file meta data ############
created_by: parquet-rs version 54.3.1
num_columns: 16
num_rows: 29999795
num_row_groups: 29
format_version: 1.0
serialized_size: 51889

However, high memory usage still seems to be an issue. It seems that the file is still created in memory first and only then flushed to disk. Not 100% sure if that is still expected after having separate row_groups.

@alamb

alamb commented Jun 30, 2025

Copy link
Copy Markdown
Collaborator Author

However, high memory usage still seems to be an issue. It seems that the file is still created in memory first and only then flushed to disk. Not 100% sure if that is still expected after having separate row_groups.

Yeah, you are right.

I think @clflushopt 's solution in #150 (comment) might be the actual fix (this row group limiting might be good in its own right, but we can discuss that separately)

@mrendi29

mrendi29 commented Jul 1, 2025

Copy link
Copy Markdown

this row group limiting might be good in its own right, but we can discuss that separately

I definitely think this should be merged as well so that we also have proper row groups.

Would you like me to open a separate issue to cover this?

@alamb

alamb commented Jul 1, 2025

Copy link
Copy Markdown
Collaborator Author

this row group limiting might be good in its own right, but we can discuss that separately

I definitely think this should be merged as well so that we also have proper row groups.

Would you like me to open a separate issue to cover this?

Thank you -- that would be amazing 🙏

@mrendi29

mrendi29 commented Jul 1, 2025

Copy link
Copy Markdown

@alamb created #152 to track this issue

@kevinjqliu kevinjqliu mentioned this pull request Jul 7, 2025
5 tasks
@mwc360

mwc360 commented Jul 8, 2025

Copy link
Copy Markdown

My 2 cents: setting a target number of rows per row group isn't ideal when generating benchmark data as depending on the size of each row you could have wildly different sizes of row groups in MB (i.e. lineitem will have almost 2x bigger RGs in MB compared to orders). Knowing how many rows to target also becomes a game of trial and error. On the inverse, having row groups that target a given size in MB offers more predictable chunks of data to parallelize, prune, etc.

In LakeBench the data generator is currently backed by DuckDB (I only didn't use tpchgen-rs due to the RGs not being configurable). Since DuckDB only supports target RGs by rows, I have to write out a sample file, calculate the size of each row and then use that to automatically calculate the number of rows in a RG that approximates my target size in MB.

@alamb

alamb commented Jul 8, 2025

Copy link
Copy Markdown
Collaborator Author

I'll see what I can do to make the target size configurable

@mrendi29

Copy link
Copy Markdown

I happened to meet @alamb in person last week and one of the things we discussed was this PR. I will open a separate PR to add some test coverage for row groups.

@kevinjqliu

Copy link
Copy Markdown
Collaborator

@mrendi29 would you like to collaborate? I have a branch with tests that I havent pushed yet

@mrendi29

Copy link
Copy Markdown

@mrendi29 would you like to collaborate? I have a branch with tests that I havent pushed yet

Absolutely! Let me know whenever you push it and we can tackle it together

@alamb

alamb commented Jul 14, 2025

Copy link
Copy Markdown
Collaborator Author

Yes that would be amazing -- if you could make a PR with some tests that showed a single large row group, I would be happy to help port this code over / update the tests

@alamb

alamb commented Jul 15, 2025

Copy link
Copy Markdown
Collaborator Author

I am starting to hack out some tests with the help of copilot

@alamb

alamb commented Jul 15, 2025

Copy link
Copy Markdown
Collaborator Author

Integration tests here:

@kevinjqliu

Copy link
Copy Markdown
Collaborator

@mrendi29 i pushed the branch at #158

@mrendi29

Copy link
Copy Markdown

Thanks @kevinjqliu , i will take a look at it in a couple of hours

@alamb

alamb commented Jul 29, 2025

Copy link
Copy Markdown
Collaborator Author

Let's go with an alternate approach instead:

I think that is a lot simpler

@alamb alamb closed this Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants