Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option of saving in parquet instead of arrow #6903

Open
arita37 opened this issue May 16, 2024 · 8 comments
Open

Add the option of saving in parquet instead of arrow #6903

arita37 opened this issue May 16, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@arita37
Copy link

arita37 commented May 16, 2024

Feature request

In dataset.save_to_disk('/path/to/save/dataset'),

add the option to save in parquet format

dataset.save_to_disk('/path/to/save/dataset', format="parquet"),

because arrow is not used for Production Big data.... (only parquet)

Motivation

because arrow is not used for Production Big data.... (only parquet)

Your contribution

I can do the testing !

@arita37 arita37 added the enhancement New feature or request label May 16, 2024
@Dref360
Copy link
Contributor

Dref360 commented May 16, 2024

I think Dataset.to_parquet is what you're looking for.

Let me know if I'm wrong

@arita37
Copy link
Author

arita37 commented May 17, 2024 via email

@lhoestq
Copy link
Member

lhoestq commented Jun 13, 2024

You can use to_parquet and ds.info.write_to_directory() to save the dataset info

@arita37
Copy link
Author

arita37 commented Jun 13, 2024 via email

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2024

Yes, and there is DatasetInfo.from_directory(). to reload the info

@arita37
Copy link
Author

arita37 commented Jun 14, 2024 via email

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2024

load_dataset doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM

@arita37
Copy link
Author

arita37 commented Jun 14, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants