-
Notifications
You must be signed in to change notification settings - Fork 59
Add test to verify integrity of loaded datasets, Avoid copying=True for LoadHF and two more loaders. Remove loader cache (python knows how to manage main memory). #1624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# log only once, here: | ||
# log once for all splits, as they are limited the same | ||
if self.get_limit() is not None: | ||
self.log_limited_loading() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecassry as inside the log limited loading there is a mechanism to log only once. Which means it is logging only once and only when the data is actually loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the mechanism hides problem. Like it hid from you the loop in LoadCSV.
This logging is cleaner, and happens just once, and does not introduce new variables.
src/unitxt/loaders.py
Outdated
if isoftype(iterables, MultiStream): | ||
return iterables | ||
return MultiStream.from_iterables(iterables, copying=True) | ||
return MultiStream.from_iterables(iterables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of lazy loader we never reach that point, so for most of the loaders the copying=True is not affecting much, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we only reach this line. No Loader returns a MultiStream from its load_iterables.
f8775b9
to
6b29d00
Compare
e4ac326
to
72c906a
Compare
72c906a
to
68d6b42
Compare
…rue for LoadH, LoadCSV, and LoadSKLearn, which watch out and do not let anyone modify their loaded dataset. Signed-off-by: dafnapension <[email protected]>
68d6b42
to
ea2f4e6
Compare
closed for no public interest |
avoid copying=true (upon iterating over the loaded dataset) for the streams generated by the loader for loaders which watch out after the datasets they load - letting no one modify them. This is the case for LoadHF, LoadCSV, and LoadSKLearn. For the rest -- copying=True remains. Also, remove the loader cache. We are not sure what goes there (a whole listed dataset? just its generators?) and python better manages its main memory knowing the whole picture (and remove to file system whole pages as it sees fit)
E.g that demonstrates how the HF dataset, returned by LoadHF, is not modifiable.