Skip to content

Resolves #87 - Exposing num_proc in standardize_data_formats #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

void-mckenzie
Copy link
Contributor

@void-mckenzie void-mckenzie commented Mar 20, 2025

This will give the user the option to run the standardize_data_formats() method with user defined num_procs. High CPU count/very low RAM space left could break this function, leaving the user with no choice but to edit library code.

Fixes #87

Code:

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset, num_proc=8) #Running with/without num_proc

Without num_proc output:

Unsloth: Standardizing formats (num_proc=32):   0%|          | 0/100000 [00:00<?, ? examples/s]🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Process SpawnPoolWorker-80:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Process SpawnPoolWorker-82:
Exception in thread Thread-1 (accepter):
Traceback (most recent call last):
Traceback (most recent call last):
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\threading.py", line 1045, in _bootstrap_inner
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\queues.py", line 370, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 434, in find_class
    return StockUnpickler.find_class(self, module, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\__init__.py", line 26, in <module>
    from .inspect import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\inspect.py", line 26, in <module>
    from .load import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\load.py", line 72, in <module>
    from .packaged_modules import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\__init__.py", line 8, in <module>
    from .audiofolder import audiofolder
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\audiofolder\audiofolder.py", line 3, in <module>
    from ..folder_based_builder import folder_based_builder
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\folder_based_builder\folder_based_builder.py", line 10, in <module>
    import pyarrow.dataset as ds
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\pyarrow\dataset.py", line 24, in <module>
    from pyarrow._dataset import (  # noqa
  File "pyarrow\\_dataset.pyx", line 159, in init pyarrow._dataset
  File "pyarrow\\_compute.pyx", line 2715, in pyarrow._compute.Expression._scalar
  File "pyarrow\\scalar.pxi", line 1293, in pyarrow.lib.scalar
  File "pyarrow\\error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 64 failed
    self.run()
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\managers.py", line 194, in accepter
    t.start()
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\threading.py", line 964, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.
It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe
Traceback (most recent call last):
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\multiprocess\queues.py", line 370, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 303, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\dill\_dill.py", line 434, in find_class
    return StockUnpickler.find_class(self, module, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\__init__.py", line 26, in <module>
    from .inspect import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\inspect.py", line 26, in <module>
    from .load import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\load.py", line 72, in <module>
    from .packaged_modules import (
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\__init__.py", line 8, in <module>
    from .audiofolder import audiofolder
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\audiofolder\audiofolder.py", line 3, in <module>
    from ..folder_based_builder import folder_based_builder
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\packaged_modules\folder_based_builder\folder_based_builder.py", line 10, in <module>
    import pyarrow.dataset as ds
  File "C:\Users\MukkeshGanesh\miniconda3\envs\unsloth_exp\Lib\site-packages\pyarrow\dataset.py", line 24, in <module>
    from pyarrow._dataset import (  # noqa
  File "pyarrow\\_dataset.pyx", line 159, in init pyarrow._dataset
  File "pyarrow\\_compute.pyx", line 2715, in pyarrow._compute.Expression._scalar
  File "pyarrow\\scalar.pxi", line 1293, in pyarrow.lib.scalar
  File "pyarrow\\error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 64 failed
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Standardizing formats (num_proc=32):   0%|          | 0/100000 [00:25<?, ? examples/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~\miniconda3\envs\unsloth_exp\Lib\site-packages\unsloth_zoo\dataset_utils.py:464, in standardize_data_formats(dataset, tokenizer, aliases_for_system, aliases_for_user, aliases_for_assistant, num_proc)
    463 try:
--> 464     return dataset.map(
    465         _standardize_dataset,
    466         batched=True,
    467         desc="Unsloth: Standardizing formats",
    468         num_proc=num_proc,
    469     )
    470 except RuntimeError as e:

File ~\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\arrow_dataset.py:557, in transmit_format.<locals>.wrapper(*args, **kwargs)
    556 # apply actual function
--> 557 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    558 datasets: list["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]

File ~\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\arrow_dataset.py:3166, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3161 with hf_tqdm(
   3162     unit=" examples",
   3163     total=pbar_total,
   3164     desc=(desc or "Map") + f" (num_proc={num_proc})",
   3165 ) as pbar:
-> 3166     for rank, done, content in iflatmap_unordered(
   3167         pool, Dataset._map_single, kwargs_iterable=kwargs_per_job
   3168     ):
   3169         if done:

File ~\miniconda3\envs\unsloth_exp\Lib\site-packages\datasets\utils\py_utils.py:713, in iflatmap_unordered(pool, func, kwargs_iterable)
    712             # One of the subprocesses has died. We should not wait forever.
--> 713             raise RuntimeError(
    714                 "One of the subprocesses has abruptly died during map operation."
    715                 "To debug the error, disable multiprocessing."
    716             )
    717 finally:

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 2
      1 from unsloth.chat_templates import standardize_data_formats
----> 2 dataset = standardize_data_formats(dataset)

File ~\miniconda3\envs\unsloth_exp\Lib\site-packages\unsloth_zoo\dataset_utils.py:471, in standardize_data_formats(dataset, tokenizer, aliases_for_system, aliases_for_user, aliases_for_assistant, num_proc)
    464     return dataset.map(
    465         _standardize_dataset,
    466         batched=True,
    467         desc="Unsloth: Standardizing formats",
    468         num_proc=num_proc,
    469     )
    470 except RuntimeError as e:
--> 471     raise RuntimeError(
    472         f"Unsloth: Process crashed: {str(e)}\nTry reducing num_proc (currently {num_proc}) to a lower value."
    473     ) from e

RuntimeError: Unsloth: Process crashed: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
Try reducing num_proc (currently 32) to a lower value.

With num_proc:

Unsloth: Standardizing formats (num_proc=8): 100%|██████████| 100000/100000 [00:07<00:00, 12975.01 examples/s]🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

@void-mckenzie
Copy link
Contributor Author

@shimmyshimmer Any chance you can merge this? I keep needing to edit library code after every release. Pretty self-explanatory PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory Exhaustion and Multiprocessing Crash in standardize_data_formats() When RAM is Nearly Full
1 participant