Add memory-aware num_proc limiting in standardize_data_formats by yurekami · Pull Request #418 · unslothai/unsloth-zoo

yurekami · 2025-12-29T18:50:02Z

Summary

Adds memory-aware limiting for num_proc in standardize_data_formats() to prevent OOM crashes
Uses the same pattern already employed in train_on_responses_only (lines 333-340)
Allows users to explicitly set num_proc to override the automatic limiting

Problem

When processing large datasets like mlabonne/FineTome-100k, the standardize_data_formats() function would spawn workers equal to CPU count without considering available RAM. This caused:

"malloc of size 64 failed" errors from PyArrow
"One of the subprocesses has abruptly died during map operation" crashes

Solution

# Before
if num_proc is None or type(num_proc) is not int: 
    num_proc = cpu_count()

# After - memory-aware limiting
if num_proc is None or type(num_proc) is not int:
    num_proc = min(max(psutil.cpu_count()+4, 2), 64)
    try:
        memory = psutil.virtual_memory()
        memory_gb_left = memory.available / (1024 ** 3)
        if memory_gb_left < 10:
            num_proc = 1  # Too risky with low memory
        else:
            num_proc = min(num_proc, int(memory_gb_left))
    except:
        pass

Fixes #87

🤖 Generated with Claude Code

The standardize_data_formats function was using cpu_count() without any memory-based limiting, which could cause OOM crashes when processing large datasets on systems with limited RAM. This fix applies the same memory-aware pattern already used elsewhere in the codebase (e.g., train_on_responses_only at lines 333-340): - Caps num_proc at 64 to avoid excessive process spawning - Checks available memory and reduces workers accordingly - Falls back to 1 worker if available memory < 10GB Fixes unslothai#87 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · 2025-12-29T18:50:16Z

Summary of Changes

Hello @yurekami, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial enhancement to the standardize_data_formats function by integrating memory-aware process limiting. Previously, this function could lead to system instability and crashes when handling large datasets due to an uncontrolled number of worker processes. The new implementation intelligently assesses available system memory and adjusts the num_proc parameter accordingly, ensuring more robust and efficient data processing while mitigating the risk of Out-Of-Memory errors. This change significantly improves the reliability of dataset operations, especially in resource-constrained environments.

Highlights

Memory-Aware num_proc Limiting: Implemented a new mechanism in standardize_data_formats() to dynamically limit the number of processes (num_proc) based on available system memory, preventing Out-Of-Memory (OOM) crashes.
OOM Crash Prevention: Specifically addresses and resolves issues like 'malloc of size 64 failed' and 'One of the subprocesses has abruptly died during map operation' that occurred when processing large datasets without sufficient memory consideration.
User Override: The automatic memory-aware limiting can be explicitly overridden by users who set the num_proc parameter manually, providing flexibility for advanced configurations.
Consistent Pattern: The new limiting logic follows a similar pattern already established in the train_on_responses_only function, ensuring consistency across the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a memory-aware limit for num_proc in standardize_data_formats to prevent out-of-memory errors, which is a great improvement for handling large datasets. My review includes a critical fix for a missing import that would cause a runtime error, along with suggestions to improve error handling and adhere to Python best practices for type checking.

gemini-code-assist · 2025-12-29T18:51:07Z

unsloth_zoo/dataset_utils.py

+        if num_proc is None or type(num_proc) is not int:
+            # Use memory-aware default to avoid OOM crashes
+            num_proc = min(max(psutil.cpu_count()+4, 2), 64)
+            try:
+                memory = psutil.virtual_memory()
+                memory_gb_left = memory.available / (1024 ** 3)
+                if memory_gb_left < 10:
+                    num_proc = 1  # Too risky with low memory
+                else:
+                    num_proc = min(num_proc, int(memory_gb_left))
+            except:
+                pass


This block has a few areas for improvement:

Missing import (Critical): The psutil module is used without being imported, which will cause a NameError at runtime. The import should be added before psutil is used.

Broad exception clause (High): The bare except: is too broad and can hide important errors by catching system-level exceptions like KeyboardInterrupt. It's safer to catch a more specific exception, like Exception.

Type checking (Medium): Using isinstance() is the recommended, more robust way to check types in Python, as it correctly handles subclasses.

I've provided a single suggestion below that addresses all these points.

Suggested change

if num_proc is None or type(num_proc) is not int:

# Use memory-aware default to avoid OOM crashes

num_proc = min(max(psutil.cpu_count()+4, 2), 64)

try:

memory = psutil.virtual_memory()

memory_gb_left = memory.available / (1024 ** 3)

if memory_gb_left < 10:

num_proc = 1 # Too risky with low memory

else:

num_proc = min(num_proc, int(memory_gb_left))

except:

pass

if num_proc is None or not isinstance(num_proc, int):

import psutil

# Use memory-aware default to avoid OOM crashes

num_proc = min(max(psutil.cpu_count()+4, 2), 64)

try:

memory = psutil.virtual_memory()

memory_gb_left = memory.available / (1024 ** 3)

if memory_gb_left < 10:

num_proc = 1 # Too risky with low memory

else:

num_proc = min(num_proc, int(memory_gb_left))

except Exception:

pass

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-29T18:52:39Z

unsloth_zoo/dataset_utils.py

+        if num_proc is None or type(num_proc) is not int:
+            # Use memory-aware default to avoid OOM crashes
+            num_proc = min(max(psutil.cpu_count()+4, 2), 64)
+            try:


Import psutil before using it in num_proc default

When num_proc is left as None (the default) and the dataset is not an IterableDataset, this new path calls psutil.cpu_count() and psutil.virtual_memory() without importing psutil in standardize_data_formats(). Unlike the earlier helper, there is no local import psutil in this function and no module-level import, so this will raise NameError: name 'psutil' is not defined and crash the mapping call on the default path.

Useful? React with 👍 / 👎.

Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.

mmathew23 · 2026-01-14T20:20:38Z

unsloth_zoo/dataset_utils.py

+        if num_proc is None or type(num_proc) is not int:
+            # Use memory-aware default to avoid OOM crashes
+            num_proc = min(max(psutil.cpu_count()+4, 2), 64)
+            try:


Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.

mmathew23 · 2026-01-14T20:22:03Z

unsloth_zoo/dataset_utils.py

+            try:
+                memory = psutil.virtual_memory()
+                memory_gb_left = memory.available / (1024 ** 3)
+                if memory_gb_left < 10:


I'm not sure if 10 is the correct threshold either. Would need to be tested, and I think on the T4 would likely force just 1 process.

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 29, 2025

View reviewed changes

mmathew23 requested changes Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add memory-aware num_proc limiting in standardize_data_formats#418

Add memory-aware num_proc limiting in standardize_data_formats#418
yurekami wants to merge 1 commit intounslothai:mainfrom
yurekami:fix/memory-aware-num-proc

yurekami commented Dec 29, 2025

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 29, 2025

Uh oh!

mmathew23 Jan 14, 2026

Uh oh!

mmathew23 Jan 14, 2026

Uh oh!

mmathew23 Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

yurekami commented Dec 29, 2025

Summary

Problem

Solution

Uh oh!

gemini-code-assist bot commented Dec 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

mmathew23 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mmathew23 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mmathew23 Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants