Skip to content

Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dafnapension
Copy link
Collaborator

@dafnapension dafnapension commented Feb 22, 2025

avoid copying=true (upon iterating over the loaded dataset) for the streams generated by the loader for loaders which watch out after the datasets they load - letting no one modify them. This is the case for LoadHF, LoadCSV, and LoadSKLearn. For the rest -- copying=True remains. Also, remove the loader cache. We are not sure what goes there (a whole listed dataset? just its generators?) and python better manages its main memory knowing the whole picture (and remove to file system whole pages as it sees fit)

E.g that demonstrates how the HF dataset, returned by LoadHF, is not modifiable.

    from datasets import load_dataset as hf_load_dataset
    ds = hf_load_dataset("PrimeQA/clapnq_passages")
    print(ds)
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        print(instance["id"])
        instance["id"] = 5
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        old_value = instance.pop("id")
        instance["id_new"] = old_value
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        print("id_new" in instance)
    DatasetDict({
        train: Dataset({
            features: ['id', 'text', 'title'],
            num_rows: 178890
        })
    })
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    False
    827849752_358-743
    False
    827849752_744-1186
    False
    827849752_1187-1426
    False
    827849752_1604-3189
    False

@dafnapension dafnapension changed the title no cache to loadHF no cache to loaders Feb 22, 2025
@dafnapension dafnapension changed the title no cache to loaders no main-memory cache to loaders Feb 22, 2025
Comment on lines +199 to +242
# log only once, here:
# log once for all splits, as they are limited the same
if self.get_limit() is not None:
self.log_limited_loading()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecassry as inside the log limited loading there is a mechanism to log only once. Which means it is logging only once and only when the data is actually loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the mechanism hides problem. Like it hid from you the loop in LoadCSV.
This logging is cleaner, and happens just once, and does not introduce new variables.

Comment on lines 173 to 188
if isoftype(iterables, MultiStream):
return iterables
return MultiStream.from_iterables(iterables, copying=True)
return MultiStream.from_iterables(iterables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of lazy loader we never reach that point, so for most of the loaders the copying=True is not affecting much, no?

Copy link
Collaborator Author

@dafnapension dafnapension Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we only reach this line. No Loader returns a MultiStream from its load_iterables.

@dafnapension dafnapension force-pushed the no_loader_cache branch 5 times, most recently from f8775b9 to 6b29d00 Compare February 24, 2025 08:16
@dafnapension dafnapension changed the title no main-memory cache to loaders Fixed redundant, time-consuming loop at LoadCSV, and avoid copying=True for LoadHF and two more loaders Feb 25, 2025
@dafnapension dafnapension force-pushed the no_loader_cache branch 5 times, most recently from e4ac326 to 72c906a Compare February 25, 2025 20:07
@dafnapension dafnapension changed the title Fixed redundant, time-consuming loop at LoadCSV, and avoid copying=True for LoadHF and two more loaders avoid copying=True for LoadHF and two more loaders. remove loader cache (python knows how to manage main memory). Add test to verify integrity of loaded datasets Feb 25, 2025
@dafnapension dafnapension changed the title avoid copying=True for LoadHF and two more loaders. remove loader cache (python knows how to manage main memory). Add test to verify integrity of loaded datasets Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). Feb 25, 2025
…rue for LoadH, LoadCSV, and LoadSKLearn, which watch out and do not let anyone modify their loaded dataset.

Signed-off-by: dafnapension <[email protected]>
@dafnapension
Copy link
Collaborator Author

closed for no public interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants