Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Dataset.map to reuse cache files mapped with different num_proc #7434

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ringohoffman
Copy link
Contributor

@ringohoffman ringohoffman commented Mar 4, 2025

Fixes #7433

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other's cache files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and map any missing shards

Other than the structural refactor, the main contribution of this PR is existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

Fixes huggingface#7433

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards

Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
@ringohoffman ringohoffman changed the title Refactor Dataset.map to reuse cache files mapped with different num_proc Refactor Dataset.map to reuse cache files mapped with different num_proc Mar 4, 2025
@ringohoffman
Copy link
Contributor Author

@lhoestq please let me know what you think about this.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks great ! It seems there is one test failing in the CI, can you take a look ?

I also added some (minor) comments

def pbar_total(num_shards: int, batch_size: Optional[int]) -> int:
total = len(self)
if len(existing_cache_files) < num_shards:
total -= len(existing_cache_files) * total // num_shards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the total be the same even if some shards have already been computed ?

As a user I'd expect the progress bar to resume from where I was in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of subtracting it from the total, we can use the initial parameter:

import os
import datasets

dataset = datasets.load_dataset("ylecun/mnist")
cache_file_name="./cache/train.map"

dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=10)

os.remove("./cache/train_00001_of_00010.map")
os.remove("./cache/train_00002_of_00010.map")

dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=5)
Map (num_proc=5):  80%|████████  | 48000/60000 [00:00<?, ? examples/s]
Map (num_proc=5): 100%|██████████| 60000/60000 [00:00<00:00, 28528.31 examples/s]

See bb7f9b5

… at all and written by the main process or not
…f raising ValueError

instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true
@ringohoffman
Copy link
Contributor Author

It looks like I can't change the merge target to #7435, so it will look like there is a bunch of extra stuff until #7435 is in main.

@ringohoffman
Copy link
Contributor Author

@lhoestq Thanks so much for reviewing #7435! Now that that's merged, I think this PR is ready!! Can you kick off CI when you get the chance?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ringohoffman ringohoffman requested a review from lhoestq March 13, 2025 10:06
@ringohoffman
Copy link
Contributor Author

Do you mind kicking off CI again?

lhoestq and others added 2 commits March 14, 2025 15:17
All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before
@ringohoffman
Copy link
Contributor Author

The change I made to support windows paths in 637c160 ended up breaking causing these tests in tests/test_data_files.py. When I removed glob_pattern_to_regex in 583c28e, none of the tests failed. So I'm thinking the unicode_escape may be handling the what glob_pattern_to_regex was doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset.map ignores existing caches and remaps when ran with different num_proc
3 participants