-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Dataset.map
to reuse cache files mapped with different num_proc
#7434
base: main
Are you sure you want to change the base?
Refactor Dataset.map
to reuse cache files mapped with different num_proc
#7434
Conversation
Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
Dataset.map
to reuse cache files mapped with different num_proc
@lhoestq please let me know what you think about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks great ! It seems there is one test failing in the CI, can you take a look ?
I also added some (minor) comments
src/datasets/arrow_dataset.py
Outdated
def pbar_total(num_shards: int, batch_size: Optional[int]) -> int: | ||
total = len(self) | ||
if len(existing_cache_files) < num_shards: | ||
total -= len(existing_cache_files) * total // num_shards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the total be the same even if some shards have already been computed ?
As a user I'd expect the progress bar to resume from where I was in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of subtracting it from the total, we can use the initial parameter:
import os
import datasets
dataset = datasets.load_dataset("ylecun/mnist")
cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=10)
os.remove("./cache/train_00001_of_00010.map")
os.remove("./cache/train_00002_of_00010.map")
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=5)
Map (num_proc=5): 80%|████████ | 48000/60000 [00:00<?, ? examples/s]
Map (num_proc=5): 100%|██████████| 60000/60000 [00:00<00:00, 28528.31 examples/s]
See bb7f9b5
also fix spacing in message
… at all and written by the main process or not
…f raising ValueError instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true
…e-on-different-num_proc
…e-on-different-num_proc
…e-on-different-num_proc
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Do you mind kicking off CI again? |
All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before
The change I made to support windows paths in 637c160 ended up breaking causing these tests in tests/test_data_files.py. When I removed |
Fixes #7433
This refactor unifies
num_proc is None or num_proc == 1
andnum_proc > 1
; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs andself
, by wrapping thenum_proc == 1
case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other's cache files just by changingnum_shards
;num_proc == 1
can sequentially load the shards of a dataset mappednum_shards > 1
and map any missing shardsOther than the structural refactor, the main contribution of this PR is
existing_cache_file_map
, which uses a regex ofcache_file_name
andsuffix_template
to find existing cache files, grouped by theirnum_shards
; using this data structure, we can resetnum_shards
to an existing set of cache files, and load them accordingly