Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` #7434

ringohoffman · 2025-03-04T06:12:37Z

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other's cache files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and map any missing shards

Other than the structural refactor, the main contribution of this PR is existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

ringohoffman · 2025-03-04T06:33:40Z

@lhoestq please let me know what you think about this.

lhoestq

This approach looks great ! It seems there is one test failing in the CI, can you take a look ?

I also added some (minor) comments

lhoestq · 2025-03-04T15:42:44Z

src/datasets/arrow_dataset.py

+        def pbar_total(num_shards: int, batch_size: Optional[int]) -> int:
+            total = len(self)
+            if len(existing_cache_files) < num_shards:
+                total -= len(existing_cache_files) * total // num_shards


Shouldn't the total be the same even if some shards have already been computed ?

As a user I'd expect the progress bar to resume from where I was in this case

Instead of subtracting it from the total, we can use the initial parameter:

import os import datasets dataset = datasets.load_dataset("ylecun/mnist") cache_file_name="./cache/train.map" dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=10) os.remove("./cache/train_00001_of_00010.map") os.remove("./cache/train_00002_of_00010.map") dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=5)

Map (num_proc=5): 80%|████████ | 48000/60000 [00:00<?, ? examples/s] Map (num_proc=5): 100%|██████████| 60000/60000 [00:00<00:00, 28528.31 examples/s]

See bb7f9b5

src/datasets/arrow_dataset.py

also fix spacing in message

… at all and written by the main process or not

…f raising ValueError instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true

…e-on-different-num_proc

ringohoffman · 2025-03-04T22:13:16Z

It looks like I can't change the merge target to #7435, so it will look like there is a bunch of extra stuff until #7435 is in main.

huggingface#7434 (comment)

…e-on-different-num_proc

they can be local file paths here https://github.com/huggingface/datasets/actions/runs/13683185040/job/38380924390?pr=7435#step:10:9731

…e-on-different-num_proc

ringohoffman · 2025-03-12T17:29:47Z

@lhoestq Thanks so much for reviewing #7435! Now that that's merged, I think this PR is ready!! Can you kick off CI when you get the chance?

HuggingFaceDocBuilderDev · 2025-03-12T17:35:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ringohoffman · 2025-03-14T12:16:28Z

Do you mind kicking off CI again?

All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before

ringohoffman · 2025-03-14T18:58:54Z

The change I made to support windows paths in 637c160 ended up breaking causing these tests in tests/test_data_files.py. When I removed glob_pattern_to_regex in 583c28e, none of the tests failed. So I'm thinking the unicode_escape may be handling the what glob_pattern_to_regex was doing.

ringohoffman · 2025-03-21T12:37:15Z

@lhoestq will you have a chance to review this today?

trotsky1997 · 2025-05-12T10:05:51Z

Any update?

lhoestq

LGTM and sorry for the delay !

note that the CI failures are unrelated to this PR :)

trotsky1997 · 2025-05-12T23:35:56Z

LGTM and sorry for the delay !

note that the CI failures are unrelated to this PR :)

Great job!

lhoestq · 2025-05-14T10:45:09Z

great job to @ringohoffman ! ;)

ringohoffman changed the title ~~Refactor Dataset.map to reuse cache files mapped with different num_proc~~ Refactor Dataset.map to reuse cache files mapped with different num_proc Mar 4, 2025

lhoestq reviewed Mar 4, 2025

View reviewed changes

Matthew Hoffman added 3 commits March 4, 2025 10:23

Only give reprocessing message doing a partial remap

bdc17c9

also fix spacing in message

Update logging message to account for if a cache file will be written…

d7c63fd

… at all and written by the main process or not

ringohoffman mentioned this pull request Mar 4, 2025

Refactor string_to_dict to return None if there is no match instead of raising ValueError #7435

Merged

Merge branch 'return-none-if-string_to_dict-no-match' into reuse-cach…

7f50b98

…e-on-different-num_proc

Matthew Hoffman added 8 commits March 4, 2025 16:16

Simplify existing existing_cache_file_map with string_to_dict

79dc83b

huggingface#7434 (comment)

Set initial value if there are already existing cache files

bb7f9b5

huggingface#7434 (comment)

Merge branch 'main' into return-none-if-string_to_dict-no-match

dafe4f2

Merge branch 'return-none-if-string_to_dict-no-match' into reuse-cach…

e2c1a5c

…e-on-different-num_proc

Allow for source_url_fields to be None

c82cab4

they can be local file paths here https://github.com/huggingface/datasets/actions/runs/13683185040/job/38380924390?pr=7435#step:10:9731

Merge branch 'main' into return-none-if-string_to_dict-no-match

28d82dc

Merge branch 'return-none-if-string_to_dict-no-match' into reuse-cach…

71b6d16

…e-on-different-num_proc

Merge branch 'main' into reuse-cache-on-different-num_proc

8cc0186

Add unicode escape to handle parsing string_to_dict in Windows paths

637c160

ringohoffman requested a review from lhoestq March 13, 2025 10:06

lhoestq and others added 2 commits March 14, 2025 15:17

Merge branch 'main' into reuse-cache-on-different-num_proc

25c0015

Remove glob_pattern_to_regex

583c28e

All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before

lhoestq and others added 2 commits May 12, 2025 16:27

Merge branch 'main' into reuse-cache-on-different-num_proc

89dc3ba

fix dependencies

87b468b

lhoestq approved these changes May 12, 2025

View reviewed changes

lhoestq merged commit b9efdc6 into huggingface:main May 12, 2025
8 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` #7434

Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` #7434

Uh oh!

ringohoffman commented Mar 4, 2025 •

edited

Loading

Uh oh!

ringohoffman commented Mar 4, 2025

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Mar 4, 2025

Uh oh!

ringohoffman Mar 4, 2025

Uh oh!

Uh oh!

ringohoffman commented Mar 4, 2025

Uh oh!

ringohoffman commented Mar 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2025

Uh oh!

ringohoffman commented Mar 14, 2025

Uh oh!

ringohoffman commented Mar 14, 2025

Uh oh!

ringohoffman commented Mar 21, 2025

Uh oh!

trotsky1997 commented May 12, 2025

Uh oh!

lhoestq left a comment •

edited

Loading

Uh oh!

Uh oh!

trotsky1997 commented May 12, 2025

Uh oh!

lhoestq commented May 14, 2025

Uh oh!

Uh oh!

Refactor Dataset.map to reuse cache files mapped with different num_proc #7434

Refactor Dataset.map to reuse cache files mapped with different num_proc #7434

Uh oh!

Conversation

ringohoffman commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ringohoffman commented Mar 4, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

ringohoffman Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ringohoffman commented Mar 4, 2025

Uh oh!

ringohoffman commented Mar 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2025

Uh oh!

ringohoffman commented Mar 14, 2025

Uh oh!

ringohoffman commented Mar 14, 2025

Uh oh!

ringohoffman commented Mar 21, 2025

Uh oh!

trotsky1997 commented May 12, 2025

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trotsky1997 commented May 12, 2025

Uh oh!

lhoestq commented May 14, 2025

Uh oh!

Uh oh!

Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` #7434

Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` #7434

ringohoffman commented Mar 4, 2025 •

edited

Loading

lhoestq left a comment •

edited

Loading