`num_proc=os.cpu_count()` when counting unused tokens #104

IsaacBreen · 2025-03-27T07:22:37Z

This just sets num_proc=os.cpu_count() when counting unused tokens. Tested on a 64-core machine. Runs approximately 64 times faster.

This should be the default for all those HF map functions imho.

…n-counting-unused-tokens `num_proc=os.cpu_count()` when counting unused tokens

IsaacBreen · 2025-03-27T07:22:54Z

unslothai/unsloth#1125

marcandrelarochelle · 2025-03-27T14:18:27Z

unsloth_zoo/tokenizer_utils.py

@@ -416,7 +417,7 @@ def mapping(examples):
        counter = np.fromiter(itertools.chain.from_iterable(input_ids), dtype = np.int32)
        np.add.at(final_counts, counter, 1)
    pass
-    train_dataset.map(mapping, batched = True, desc = "Counting untrained tokens")
+    train_dataset.map(mapping, batched = True, desc = "Counting untrained tokens", num_proc = os.cpu_count())


This will break IterableDataset supports as they do not support the num_proc argument.

IsaacBreen and others added 4 commits March 21, 2025 21:16

fix

067a83e

Update tokenizer_utils.py

0f31b1a

style

1371921

Merge pull request #1 from IsaacBreen/use-num_proc=os.cpu_count()-whe…

559aecd

…n-counting-unused-tokens `num_proc=os.cpu_count()` when counting unused tokens

IsaacBreen added 3 commits March 27, 2025 19:45

Merge remote-tracking branch 'upstream/main'

361a35d

Merge remote-tracking branch 'origin/main'

2df696e

maybe fix

97a3992

marcandrelarochelle reviewed Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`num_proc=os.cpu_count()` when counting unused tokens #104

`num_proc=os.cpu_count()` when counting unused tokens #104

IsaacBreen commented Mar 27, 2025

Uh oh!

IsaacBreen commented Mar 27, 2025

Uh oh!

marcandrelarochelle Mar 27, 2025

Uh oh!

Uh oh!

num_proc=os.cpu_count() when counting unused tokens #104

Are you sure you want to change the base?

num_proc=os.cpu_count() when counting unused tokens #104

Conversation

IsaacBreen commented Mar 27, 2025

Uh oh!

IsaacBreen commented Mar 27, 2025

Uh oh!

marcandrelarochelle Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

`num_proc=os.cpu_count()` when counting unused tokens #104

`num_proc=os.cpu_count()` when counting unused tokens #104