Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330

Merged
merged 3 commits into from
Feb 11, 2025

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 22, 2025

Rationale for this change

Parquet file fragments currently cache their (Parquet) metadata for later accesses when scanning has finished.
This can produce surprisingly high memory consumption in cases where:

  1. the dataset is only scanned once, rather than repeatedly (this is very common)
  2. there is a high metadata-to-data ratio; this can happen when the schemas on disk are very wide, with few rows per file and/or a low number of columns selected for reading

What changes are included in this PR?

Add an option to disable metadata caching on Parquet file fragments.

Are these changes tested?

Yes, by new unit tests. Also, reading a wide dataset locally has been confirmed to consume much less memory when the new option is toggled.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #37630 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou force-pushed the gh37630-dataset-cached-metadata branch 4 times, most recently from 561a9f2 to 6058a91 Compare January 28, 2025 17:43
@pitrou pitrou force-pushed the gh37630-dataset-cached-metadata branch from 6058a91 to 01bb19e Compare January 29, 2025 16:56
@pitrou pitrou marked this pull request as ready for review January 29, 2025 16:56
@pitrou pitrou requested a review from westonpace as a code owner January 29, 2025 16:56
@pitrou pitrou requested a review from zanmato1984 January 29, 2025 17:00
@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2025

In #45287 (comment) it was mentioned that clearing physical_schema_ further improved the memory profile, but it's not obvious that it would be correct.

@icexelloss
Copy link
Contributor

I wonder if we can have a mode to release the fragment once the data for that fragment has been read in a "scan once" usage pattern. But also I don't know how hard it is to change that.

Per physical_schema_ - I am not familiar with the details but it seems fine to me if this is "scan once" and the fragment is not used again?

@timothydijamco
Copy link

I posted some thoughts about clearing physical_schema_ in #45287 (comment)

@pitrou pitrou force-pushed the gh37630-dataset-cached-metadata branch from 01bb19e to 09755d6 Compare February 4, 2025 09:59
@pitrou
Copy link
Member Author

pitrou commented Feb 4, 2025

I added a change that clears the cached physical schema, but keeps the original schema when it was passed via the constructor.

@pitrou
Copy link
Member Author

pitrou commented Feb 6, 2025

@zanmato1984 I don't know if you would like to take a quick look at this PR.

(sorry if you received a ping on the issue, I initially chose the wrong tab :))

@zanmato1984
Copy link
Contributor

Hi @pitrou , I'm on vacation for now and I will be able to review it next week. Will that be too late for you? Thanks.

@pitrou
Copy link
Member Author

pitrou commented Feb 6, 2025

@zanmato1984 No, it's ok of course! Thank you.

@zanmato1984
Copy link
Contributor

the dataset is only scanned once, rather than repeatedly (this is very common)

A quick question: by "very common" you mean scan once or repeatedly?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 11, 2025
@pitrou
Copy link
Member Author

pitrou commented Feb 11, 2025

A quick question: by "very common" you mean scan once or repeatedly?

I mean scan once.

@pitrou pitrou force-pushed the gh37630-dataset-cached-metadata branch from 09755d6 to 918c50e Compare February 11, 2025 09:37
@pitrou
Copy link
Member Author

pitrou commented Feb 11, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: 918c50e

Submitted crossbow builds: ursacomputing/crossbow @ actions-36b0a9da17

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Feb 11, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: 47e56fa

Submitted crossbow builds: ursacomputing/crossbow @ actions-a9b6c9cf52

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Feb 11, 2025

I plan to merge this soon if there are no further comments.

@pitrou pitrou merged commit f8a0902 into apache:main Feb 11, 2025
42 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Feb 11, 2025
@pitrou pitrou deleted the gh37630-dataset-cached-metadata branch February 11, 2025 13:46
@pitrou
Copy link
Member Author

pitrou commented Feb 11, 2025

Thanks for the reviews!

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f8a0902.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants