Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no main-memory cache to loaders #1624

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

no main-memory cache to loaders #1624

wants to merge 1 commit into from

Conversation

dafnapension
Copy link
Collaborator

@dafnapension dafnapension commented Feb 22, 2025

and no copying=true for the streams generated by the loader for loaders which watch out after the datasets they load - letting no one modify them. This is the case for LoadHF, LoadCSV, and LoadSKLearn. For the rest -- copying=True remains.

E.g that demonstrates how the HF dataset, returned by LoadHF, is not modifiable.

    from datasets import load_dataset as hf_load_dataset
    ds = hf_load_dataset("PrimeQA/clapnq_passages")
    print(ds)
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        print(instance["id"])
        instance["id"] = 5
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        old_value = instance.pop("id")
        instance["id_new"] = old_value
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        print("id_new" in instance)
    DatasetDict({
        train: Dataset({
            features: ['id', 'text', 'title'],
            num_rows: 178890
        })
    })
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    False
    827849752_358-743
    False
    827849752_744-1186
    False
    827849752_1187-1426
    False
    827849752_1604-3189
    False

@dafnapension dafnapension changed the title no cache to loadHF no cache to loaders Feb 22, 2025
@dafnapension dafnapension changed the title no cache to loaders no main-memory cache to loaders Feb 22, 2025
Comment on lines +199 to +226
# log only once, here:
# log once for all splits, as they are limited the same
if self.get_limit() is not None:
self.log_limited_loading()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecassry as inside the log limited loading there is a mechanism to log only once. Which means it is logging only once and only when the data is actually loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the mechanism hides problem. Like it hid from you the loop in LoadCSV.
This logging is cleaner, and happens just once, and does not introduce new variables.

Comment on lines 173 to 188
if isoftype(iterables, MultiStream):
return iterables
return MultiStream.from_iterables(iterables, copying=True)
return MultiStream.from_iterables(iterables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of lazy loader we never reach that point, so for most of the loaders the copying=True is not affecting much, no?

Copy link
Collaborator Author

@dafnapension dafnapension Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we only reach this line. No Loader returns a MultiStream from its load_iterables.

@dafnapension dafnapension force-pushed the no_loader_cache branch 4 times, most recently from ff9a435 to f8775b9 Compare February 23, 2025 17:57
…rue for LoadH, LoadCSV, and LoadSKLearn, which watch out and do not let anyone modify their loaded dataset

Signed-off-by: dafnapension <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants