Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

dafnapension · 2025-02-22T12:05:47Z

avoid copying=true (upon iterating over the loaded dataset) for the streams generated by the loader for loaders which watch out after the datasets they load - letting no one modify them. This is the case for LoadHF, LoadCSV, and LoadSKLearn. For the rest -- copying=True remains. Also, remove the loader cache. We are not sure what goes there (a whole listed dataset? just its generators?) and python better manages its main memory knowing the whole picture (and remove to file system whole pages as it sees fit)

E.g that demonstrates how the HF dataset, returned by LoadHF, is not modifiable.

    from datasets import load_dataset as hf_load_dataset
    ds = hf_load_dataset("PrimeQA/clapnq_passages")
    print(ds)
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        print(instance["id"])
        instance["id"] = 5
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        
    iter_train = iter(ds["train"])
    iter_train2 = iter(ds["train"])
    for _ in range(5):
        instance = next(iter_train)
        old_value = instance.pop("id")
        instance["id_new"] = old_value
    
    for _ in range(5):
        instance = next(iter_train2)
        print(instance["id"])
        print("id_new" in instance)

    DatasetDict({
        train: Dataset({
            features: ['id', 'text', 'title'],
            num_rows: 178890
        })
    })
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    827849752_358-743
    827849752_744-1186
    827849752_1187-1426
    827849752_1604-3189
    827849752_115-357
    False
    827849752_358-743
    False
    827849752_744-1186
    False
    827849752_1187-1426
    False
    827849752_1604-3189
    False

elronbandel · 2025-02-23T08:06:32Z

src/unitxt/loaders.py

+        # log only once, here:
+        # log once for all splits, as they are limited the same
+        if self.get_limit() is not None:
+            self.log_limited_loading()


This is unnecassry as inside the log limited loading there is a mechanism to log only once. Which means it is logging only once and only when the data is actually loaded.

I think that the mechanism hides problem. Like it hid from you the loop in LoadCSV.
This logging is cleaner, and happens just once, and does not introduce new variables.

elronbandel · 2025-02-23T08:10:13Z

src/unitxt/loaders.py

        if isoftype(iterables, MultiStream):
            return iterables
-        return MultiStream.from_iterables(iterables, copying=True)
+        return MultiStream.from_iterables(iterables)


In the case of lazy loader we never reach that point, so for most of the loaders the copying=True is not affecting much, no?

Currently, we only reach this line. No Loader returns a MultiStream from its load_iterables.

…rue for LoadH, LoadCSV, and LoadSKLearn, which watch out and do not let anyone modify their loaded dataset. Signed-off-by: dafnapension <[email protected]>

dafnapension · 2025-02-28T15:10:15Z

closed for no public interest

dafnapension changed the title ~~no cache to loadHF~~ no cache to loaders Feb 22, 2025

dafnapension changed the title ~~no cache to loaders~~ no main-memory cache to loaders Feb 22, 2025

elronbandel reviewed Feb 23, 2025

View reviewed changes

dafnapension force-pushed the no_loader_cache branch 5 times, most recently from f8775b9 to 6b29d00 Compare February 24, 2025 08:16

dafnapension mentioned this pull request Feb 24, 2025

surfacing a problem in current LoadHF #1599

Closed

dafnapension changed the title ~~no main-memory cache to loaders~~ Fixed redundant, time-consuming loop at LoadCSV, and avoid copying=True for LoadHF and two more loaders Feb 25, 2025

dafnapension force-pushed the no_loader_cache branch 5 times, most recently from e4ac326 to 72c906a Compare February 25, 2025 20:07

dafnapension force-pushed the no_loader_cache branch from 72c906a to 68d6b42 Compare February 25, 2025 20:26

No cache to loaders. Python worries about this. Also, avoid copying=T…

ea2f4e6

…rue for LoadH, LoadCSV, and LoadSKLearn, which watch out and do not let anyone modify their loaded dataset. Signed-off-by: dafnapension <[email protected]>

dafnapension force-pushed the no_loader_cache branch from 68d6b42 to ea2f4e6 Compare February 27, 2025 14:19

dafnapension closed this Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

Uh oh!

dafnapension commented Feb 22, 2025 •

edited

Loading

Uh oh!

elronbandel Feb 23, 2025

Uh oh!

dafnapension Feb 23, 2025

Uh oh!

elronbandel Feb 23, 2025

Uh oh!

dafnapension Feb 23, 2025 •

edited

Loading

Uh oh!

dafnapension commented Feb 28, 2025

Uh oh!

Uh oh!

Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624

Uh oh!

Conversation

dafnapension commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elronbandel Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

elronbandel Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

dafnapension Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dafnapension commented Feb 28, 2025

Uh oh!

Uh oh!

dafnapension commented Feb 22, 2025 •

edited

Loading

dafnapension Feb 23, 2025 •

edited

Loading