-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330
Conversation
|
561a9f2
to
6058a91
Compare
6058a91
to
01bb19e
Compare
In #45287 (comment) it was mentioned that clearing |
I wonder if we can have a mode to release the fragment once the data for that fragment has been read in a "scan once" usage pattern. But also I don't know how hard it is to change that. Per |
I posted some thoughts about clearing |
01bb19e
to
09755d6
Compare
I added a change that clears the cached physical schema, but keeps the original schema when it was passed via the constructor. |
@zanmato1984 I don't know if you would like to take a quick look at this PR. (sorry if you received a ping on the issue, I initially chose the wrong tab :)) |
Hi @pitrou , I'm on vacation for now and I will be able to review it next week. Will that be too late for you? Thanks. |
@zanmato1984 No, it's ok of course! Thank you. |
A quick question: by "very common" you mean scan once or repeatedly? |
I mean scan once. |
09755d6
to
918c50e
Compare
@github-actions crossbow submit -g cpp |
Revision: 918c50e Submitted crossbow builds: ursacomputing/crossbow @ actions-36b0a9da17 |
@github-actions crossbow submit -g cpp |
Revision: 47e56fa Submitted crossbow builds: ursacomputing/crossbow @ actions-a9b6c9cf52 |
I plan to merge this soon if there are no further comments. |
Thanks for the reviews! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f8a0902. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Parquet file fragments currently cache their (Parquet) metadata for later accesses when scanning has finished.
This can produce surprisingly high memory consumption in cases where:
What changes are included in this PR?
Add an option to disable metadata caching on Parquet file fragments.
Are these changes tested?
Yes, by new unit tests. Also, reading a wide dataset locally has been confirmed to consume much less memory when the new option is toggled.
Are there any user-facing changes?
No.