Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions unsloth_zoo/dataset_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,10 +484,20 @@ def _standardize_dataset(examples):
}

if not isinstance(dataset, IterableDataset):
from multiprocessing import cpu_count

if num_proc is None or type(num_proc) is not int:
num_proc = cpu_count()
import psutil

if num_proc is None or type(num_proc) is not int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's more Pythonic to use isinstance() for type checking rather than comparing types directly with type(). This is more robust as it correctly handles subclasses.

Suggested change
if num_proc is None or type(num_proc) is not int:
if num_proc is None or not isinstance(num_proc, int):

# Use a memory-aware default to prevent OOM with large datasets
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory_gb_left = psutil.virtual_memory().available / 1024 / 1024 / 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other parts of the codebase (e.g., the train_on_responses_only function) and for improved readability, it's better to use (1024**3) for converting bytes to gigabytes.

Suggested change
memory_gb_left = psutil.virtual_memory().available / 1024 / 1024 / 1024
memory_gb_left = psutil.virtual_memory().available / (1024**3)

if memory_gb_left < 4:
num_proc = 1 # Too risky, so set to 1
else:
# Limit based on available memory (assume ~1GB per worker)
num_proc = min(num_proc, max(1, int(memory_gb_left)))
except:
pass
Comment on lines +499 to +500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a bare except: is a dangerous practice as it catches all exceptions, including system-exiting ones like SystemExit and KeyboardInterrupt, which can hide critical issues and make debugging difficult. It's better to catch a more specific exception, such as Exception.

Suggested change
except:
pass
except Exception:
pass


dataset_map_kwargs['num_proc'] = num_proc
dataset_map_kwargs['desc'] = "Unsloth: Standardizing formats"
Expand Down