You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
datasets.load_dataset() triggers a new download of the whole dataset when the README.md file has been touched on huggingface hub, even if data files / parquet files are the exact same.
I think the current behaviour of the load_dataset function is triggered whenever a change of the hash of latest commit on huggingface hub, but is there a clever way to only download again the dataset if and only if data is modified ?
Motivation
The current behaviour is a waste of network bandwidth / disk space / research time.
Your contribution
I don't have time to submit a PR, but I hope a simple solution will emerge from this issue !
The text was updated successfully, but these errors were encountered:
zinc75
changed the title
Avoid downloading the whole dataset when only README.me has be touched on hub.
Avoid downloading the whole dataset when only README.me has been touched on hub.
May 29, 2024
Feature request
datasets.load_dataset()
triggers a new download of the whole dataset when the README.md file has been touched on huggingface hub, even if data files / parquet files are the exact same.I think the current behaviour of the load_dataset function is triggered whenever a change of the hash of latest commit on huggingface hub, but is there a clever way to only download again the dataset if and only if data is modified ?
Motivation
The current behaviour is a waste of network bandwidth / disk space / research time.
Your contribution
I don't have time to submit a PR, but I hope a simple solution will emerge from this issue !
The text was updated successfully, but these errors were encountered: