Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid downloading the whole dataset when only README.me has been touched on hub. #6929

Open
zinc75 opened this issue May 29, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@zinc75
Copy link

zinc75 commented May 29, 2024

Feature request

datasets.load_dataset() triggers a new download of the whole dataset when the README.md file has been touched on huggingface hub, even if data files / parquet files are the exact same.

I think the current behaviour of the load_dataset function is triggered whenever a change of the hash of latest commit on huggingface hub, but is there a clever way to only download again the dataset if and only if data is modified ?

Motivation

The current behaviour is a waste of network bandwidth / disk space / research time.

Your contribution

I don't have time to submit a PR, but I hope a simple solution will emerge from this issue !

@zinc75 zinc75 added the enhancement New feature or request label May 29, 2024
@zinc75 zinc75 changed the title Avoid downloading the whole dataset when only README.me has be touched on hub. Avoid downloading the whole dataset when only README.me has been touched on hub. May 29, 2024
@severo
Copy link
Collaborator

severo commented May 29, 2024

you're right, we're tackling this here: huggingface/dataset-viewer#2757

@zinc75
Copy link
Author

zinc75 commented May 29, 2024

@severo : great !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants