Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit huggingface cache policy - BREAKING CHANGE #1564

Merged
merged 35 commits into from
Feb 2, 2025
Merged

Conversation

elronbandel
Copy link
Member

@elronbandel elronbandel commented Jan 29, 2025

Changes:

  1. Make loading of external hugginface datasets be cached be default (by huggingface cache mechanism)
  2. Make re-loading of unitxt dataset not cached by default (unless user asked for it explicitly)
  3. Make processing of external huggingface datasets be without streaming unless specified explicitly

The disable_cache param of load_dataset is not called use_cache.

@@ -823,6 +823,7 @@ class LoadFromHFSpace(LoadHF):
use_token: Optional[bool] = None
token_env: Optional[str] = None
requirements_list: List[str] = ["huggingface_hub"]
streaming = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Once the files are downloaded to local file system, what benefit you get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont, It just dosent work otherwise. We should base this loader on dictionary loader.

@dafnapension
Copy link
Collaborator

in:

dataset[split] = dataset[split].to_iterable_dataset()

once the dataset is loaded from HF (as dataset, not iterable_dataset) why change it to iterable_dataset? I got the impression that each move of each instance in iterable_dataset incurs much energy (probably to preserve a state or something). Does Unitxt needs the dataset as iterable?

@assaftibm
Copy link
Member

assaftibm commented Jan 29, 2025

You can use this code to compare the loading time of HF and UT:

import os.path
import shutil
from time import time

from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt

path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"

if os.path.exists(path):
    shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
    shutil.rmtree(path)

# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

@elronbandel
Copy link
Member Author

elronbandel commented Jan 29, 2025

@assaftibm this PR is not yet the solution to the problem you pointed out. We broke it up to two PRs as it is global changes which we want to introduce and test gradually.

Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
@elronbandel elronbandel merged commit 7152be4 into main Feb 2, 2025
18 of 19 checks passed
@elronbandel elronbandel deleted the hf-cache branch February 2, 2025 19:30
@yoavkatz yoavkatz changed the title Revisit huggingface cache policy Revisit huggingface cache policy - BREAKING CHANGE Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants