Alamb/smaller row groups#151
Conversation
|
The Parquet viewer looks really neat! I can confirm that the PR fixes the issue of having a single rowgroup: However, high memory usage still seems to be an issue. It seems that the file is still created in memory first and only then flushed to disk. Not 100% sure if that is still expected after having separate row_groups. |
Yeah, you are right. I think @clflushopt 's solution in #150 (comment) might be the actual fix (this row group limiting might be good in its own right, but we can discuss that separately) |
I definitely think this should be merged as well so that we also have proper row groups. Would you like me to open a separate issue to cover this? |
Thank you -- that would be amazing 🙏 |
|
My 2 cents: setting a target number of rows per row group isn't ideal when generating benchmark data as depending on the size of each row you could have wildly different sizes of row groups in MB (i.e. lineitem will have almost 2x bigger RGs in MB compared to orders). Knowing how many rows to target also becomes a game of trial and error. On the inverse, having row groups that target a given size in MB offers more predictable chunks of data to parallelize, prune, etc. In LakeBench the data generator is currently backed by DuckDB (I only didn't use tpchgen-rs due to the RGs not being configurable). Since DuckDB only supports target RGs by rows, I have to write out a sample file, calculate the size of each row and then use that to automatically calculate the number of rows in a RG that approximates my target size in MB. |
|
I'll see what I can do to make the target size configurable |
|
I happened to meet @alamb in person last week and one of the things we discussed was this PR. I will open a separate PR to add some test coverage for row groups. |
|
@mrendi29 would you like to collaborate? I have a branch with tests that I havent pushed yet |
Absolutely! Let me know whenever you push it and we can tackle it together |
|
Yes that would be amazing -- if you could make a PR with some tests that showed a single large row group, I would be happy to help port this code over / update the tests |
|
I am starting to hack out some tests with the help of copilot |
|
Integration tests here: |
|
Thanks @kevinjqliu , i will take a look at it in a couple of hours |
|
Let's go with an alternate approach instead: I think that is a lot simpler |
This is a PR to test if making smaller row groups solves the problem reported in #150
I tested this PR using
Before this PR the output parquet file has a single row group
After this PR the output parquet file has 6 row groups (which makes sense for 6M row file)
BTW you can look at the output parquet using the very cool Parquet Viewer from @XiangpengHao : https://parquet-viewer.xiangpeng.systems/