Alamb/smaller row groups by alamb · Pull Request #151 · clflushopt/tpchgen-rs

alamb · 2025-06-30T21:01:48Z

Related to [BUG] Growing Memory when using partitioned writer #150
Related to [FEATURE] add an option for row group size #146

This is a PR to test if making smaller row groups solves the problem reported in #150

I tested this PR using

RUST_LOG=debug cargo run --release --  -s 100000 --format=parquet --tables lineitem --parts 100000 --part 1

Before this PR the output parquet file has a single row group

After this PR the output parquet file has 6 row groups (which makes sense for 6M row file)

> select count(*) from 'lineitem-1.parquet';
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+
1 row(s) fetched.
Elapsed 0.051 seconds.

BTW you can look at the output parquet using the very cool Parquet Viewer from @XiangpengHao : https://parquet-viewer.xiangpeng.systems/

mrendi29 · 2025-06-30T21:41:08Z

The Parquet viewer looks really neat!

I can confirm that the PR fixes the issue of having a single rowgroup:

root@localhost:~/tpchgen-rs# RUST_LOG=debug cargo run --release --  -s 750000 --format=parquet --tables lineitem --parts 150000 --part 1 --parquet-compression 'zstd(3)'
    Finished `release` profile [optimized] target(s) in 0.09s
     Running `target/release/tpchgen-cli -s 750000 --format=parquet --tables lineitem --parts 150000 --part 1 --parquet-compression 'zstd(3)'`
[2025-06-30T21:35:01Z DEBUG tpchgen_cli] Logging configured from environment variables
[2025-06-30T21:35:01Z DEBUG tpchgen_cli] Creating distributions and text pool
[2025-06-30T21:35:02Z INFO  tpchgen_cli] Created static distributions and text pools in 1.275865785s
[2025-06-30T21:35:02Z INFO  tpchgen_cli] Writing table lineitem (SF=750000) to lineitem.parquet
[2025-06-30T21:35:02Z DEBUG tpchgen_cli] Generating 150000 parts in total
[2025-06-30T21:35:02Z DEBUG tpchgen_cli::parquet] Generating Parquet with 56 threads, using ZSTD(ZstdLevel(3)) compression
[2025-06-30T21:35:36Z INFO  tpchgen_cli::statistics] Created 0.89 GB in 34.424730852s (0.03 GB/sec)
[2025-06-30T21:35:36Z DEBUG tpchgen_cli::statistics] Wrote 950384341 bytes in 29 row groups  31.25 MB/row groups
[2025-06-30T21:35:36Z INFO  tpchgen_cli] Generation complete!

root@localhost:~/tpchgen-rs# parquet-tools inspect lineitem.parquet  | head
############ file meta data ############
created_by: parquet-rs version 54.3.1
num_columns: 16
num_rows: 29999795
num_row_groups: 29
format_version: 1.0
serialized_size: 51889

However, high memory usage still seems to be an issue. It seems that the file is still created in memory first and only then flushed to disk. Not 100% sure if that is still expected after having separate row_groups.

alamb · 2025-06-30T22:54:37Z

However, high memory usage still seems to be an issue. It seems that the file is still created in memory first and only then flushed to disk. Not 100% sure if that is still expected after having separate row_groups.

Yeah, you are right.

I think @clflushopt 's solution in #150 (comment) might be the actual fix (this row group limiting might be good in its own right, but we can discuss that separately)

mrendi29 · 2025-07-01T00:57:47Z

this row group limiting might be good in its own right, but we can discuss that separately

I definitely think this should be merged as well so that we also have proper row groups.

Would you like me to open a separate issue to cover this?

alamb · 2025-07-01T14:16:14Z

this row group limiting might be good in its own right, but we can discuss that separately

I definitely think this should be merged as well so that we also have proper row groups.

Would you like me to open a separate issue to cover this?

Thank you -- that would be amazing 🙏

mrendi29 · 2025-07-01T14:45:49Z

@alamb created #152 to track this issue

mwc360 · 2025-07-08T14:31:24Z

My 2 cents: setting a target number of rows per row group isn't ideal when generating benchmark data as depending on the size of each row you could have wildly different sizes of row groups in MB (i.e. lineitem will have almost 2x bigger RGs in MB compared to orders). Knowing how many rows to target also becomes a game of trial and error. On the inverse, having row groups that target a given size in MB offers more predictable chunks of data to parallelize, prune, etc.

In LakeBench the data generator is currently backed by DuckDB (I only didn't use tpchgen-rs due to the RGs not being configurable). Since DuckDB only supports target RGs by rows, I have to write out a sample file, calculate the size of each row and then use that to automatically calculate the number of rows in a RG that approximates my target size in MB.

alamb · 2025-07-08T14:45:58Z

I'll see what I can do to make the target size configurable

mrendi29 · 2025-07-13T16:26:03Z

I happened to meet @alamb in person last week and one of the things we discussed was this PR. I will open a separate PR to add some test coverage for row groups.

kevinjqliu · 2025-07-13T16:33:51Z

@mrendi29 would you like to collaborate? I have a branch with tests that I havent pushed yet

mrendi29 · 2025-07-13T17:06:24Z

@mrendi29 would you like to collaborate? I have a branch with tests that I havent pushed yet

Absolutely! Let me know whenever you push it and we can tackle it together

alamb · 2025-07-14T20:11:00Z

Yes that would be amazing -- if you could make a PR with some tests that showed a single large row group, I would be happy to help port this code over / update the tests

alamb · 2025-07-15T12:47:04Z

I am starting to hack out some tests with the help of copilot

alamb · 2025-07-15T13:01:52Z

Integration tests here:

feat: Add integration tests for tpchgen-cli #156

kevinjqliu · 2025-07-15T17:26:47Z

@mrendi29 i pushed the branch at #158

mrendi29 · 2025-07-15T17:31:36Z

Thanks @kevinjqliu , i will take a look at it in a couple of hours

alamb · 2025-07-29T14:03:40Z

Let's go with an alternate approach instead:

fix: Create multiple row groups when writing single --parts #168

I think that is a lot simpler

Encode using smaller row groups

b2aaf73

alamb mentioned this pull request Jun 30, 2025

[BUG] Growing Memory when using partitioned writer #150

Closed

fix

114096f

mrendi29 mentioned this pull request Jul 1, 2025

[BUG] Partitioned writes produce parquet files with a single row group. #152

Closed

kevinjqliu mentioned this pull request Jul 7, 2025

[DISCUSS] v2.0.0 #154

Closed

5 tasks

alamb mentioned this pull request Jul 8, 2025

[FEATURE] add an option for row group size #146

Closed

alamb mentioned this pull request Jul 15, 2025

feat: Add integration tests for tpchgen-cli #156

Merged

alamb mentioned this pull request Jul 15, 2025

feat: Chunkify single parts to generate them in parallel #155

Merged

alamb closed this Jul 29, 2025

Conversation

alamb commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrendi29 commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

mrendi29 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

mrendi29 commented Jul 1, 2025

Uh oh!

mwc360 commented Jul 8, 2025

Uh oh!

alamb commented Jul 8, 2025

Uh oh!

mrendi29 commented Jul 13, 2025

Uh oh!

kevinjqliu commented Jul 13, 2025

Uh oh!

mrendi29 commented Jul 13, 2025

Uh oh!

alamb commented Jul 14, 2025

Uh oh!

alamb commented Jul 15, 2025

Uh oh!

alamb commented Jul 15, 2025

Uh oh!

kevinjqliu commented Jul 15, 2025

Uh oh!

mrendi29 commented Jul 15, 2025

Uh oh!

alamb commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alamb commented Jun 30, 2025 •

edited

Loading

mrendi29 commented Jun 30, 2025 •

edited

Loading

mrendi29 commented Jul 1, 2025 •

edited

Loading