Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions unsloth_zoo/dataset_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -484,10 +484,18 @@ def _standardize_dataset(examples):
}

if not isinstance(dataset, IterableDataset):
from multiprocessing import cpu_count

if num_proc is None or type(num_proc) is not int:
num_proc = cpu_count()
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
Comment on lines +487 to +490

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Import psutil before using it in num_proc default

When num_proc is left as None (the default) and the dataset is not an IterableDataset, this new path calls psutil.cpu_count() and psutil.virtual_memory() without importing psutil in standardize_data_formats(). Unlike the earlier helper, there is no local import psutil in this function and no module-level import, so this will raise NameError: name 'psutil' is not defined and crash the mapping call on the default path.

Useful? React with 👍 / 👎.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.

memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if 10 is the correct threshold either. Would need to be tested, and I think on the T4 would likely force just 1 process.

num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except:
pass
Comment on lines +487 to +498
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block has a few areas for improvement:

  • Missing import (Critical): The psutil module is used without being imported, which will cause a NameError at runtime. The import should be added before psutil is used.
  • Broad exception clause (High): The bare except: is too broad and can hide important errors by catching system-level exceptions like KeyboardInterrupt. It's safer to catch a more specific exception, like Exception.
  • Type checking (Medium): Using isinstance() is the recommended, more robust way to check types in Python, as it correctly handles subclasses.

I've provided a single suggestion below that addresses all these points.

Suggested change
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except:
pass
if num_proc is None or not isinstance(num_proc, int):
import psutil
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except Exception:
pass


dataset_map_kwargs['num_proc'] = num_proc
dataset_map_kwargs['desc'] = "Unsloth: Standardizing formats"
Expand Down