Skip to content

Comments

Add memory-aware num_proc limiting in standardize_data_formats#418

Open
yurekami wants to merge 1 commit intounslothai:mainfrom
yurekami:fix/memory-aware-num-proc
Open

Add memory-aware num_proc limiting in standardize_data_formats#418
yurekami wants to merge 1 commit intounslothai:mainfrom
yurekami:fix/memory-aware-num-proc

Conversation

@yurekami
Copy link
Contributor

Summary

  • Adds memory-aware limiting for num_proc in standardize_data_formats() to prevent OOM crashes
  • Uses the same pattern already employed in train_on_responses_only (lines 333-340)
  • Allows users to explicitly set num_proc to override the automatic limiting

Problem

When processing large datasets like mlabonne/FineTome-100k, the standardize_data_formats() function would spawn workers equal to CPU count without considering available RAM. This caused:

  • "malloc of size 64 failed" errors from PyArrow
  • "One of the subprocesses has abruptly died during map operation" crashes

Solution

# Before
if num_proc is None or type(num_proc) is not int: 
    num_proc = cpu_count()

# After - memory-aware limiting
if num_proc is None or type(num_proc) is not int:
    num_proc = min(max(psutil.cpu_count()+4, 2), 64)
    try:
        memory = psutil.virtual_memory()
        memory_gb_left = memory.available / (1024 ** 3)
        if memory_gb_left < 10:
            num_proc = 1  # Too risky with low memory
        else:
            num_proc = min(num_proc, int(memory_gb_left))
    except:
        pass

Fixes #87

🤖 Generated with Claude Code

The standardize_data_formats function was using cpu_count() without any
memory-based limiting, which could cause OOM crashes when processing large
datasets on systems with limited RAM.

This fix applies the same memory-aware pattern already used elsewhere in
the codebase (e.g., train_on_responses_only at lines 333-340):
- Caps num_proc at 64 to avoid excessive process spawning
- Checks available memory and reduces workers accordingly
- Falls back to 1 worker if available memory < 10GB

Fixes unslothai#87

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yurekami, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial enhancement to the standardize_data_formats function by integrating memory-aware process limiting. Previously, this function could lead to system instability and crashes when handling large datasets due to an uncontrolled number of worker processes. The new implementation intelligently assesses available system memory and adjusts the num_proc parameter accordingly, ensuring more robust and efficient data processing while mitigating the risk of Out-Of-Memory errors. This change significantly improves the reliability of dataset operations, especially in resource-constrained environments.

Highlights

  • Memory-Aware num_proc Limiting: Implemented a new mechanism in standardize_data_formats() to dynamically limit the number of processes (num_proc) based on available system memory, preventing Out-Of-Memory (OOM) crashes.
  • OOM Crash Prevention: Specifically addresses and resolves issues like 'malloc of size 64 failed' and 'One of the subprocesses has abruptly died during map operation' that occurred when processing large datasets without sufficient memory consideration.
  • User Override: The automatic memory-aware limiting can be explicitly overridden by users who set the num_proc parameter manually, providing flexibility for advanced configurations.
  • Consistent Pattern: The new limiting logic follows a similar pattern already established in the train_on_responses_only function, ensuring consistency across the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory-aware limit for num_proc in standardize_data_formats to prevent out-of-memory errors, which is a great improvement for handling large datasets. My review includes a critical fix for a missing import that would cause a runtime error, along with suggestions to improve error handling and adhere to Python best practices for type checking.

Comment on lines +487 to +498
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block has a few areas for improvement:

  • Missing import (Critical): The psutil module is used without being imported, which will cause a NameError at runtime. The import should be added before psutil is used.
  • Broad exception clause (High): The bare except: is too broad and can hide important errors by catching system-level exceptions like KeyboardInterrupt. It's safer to catch a more specific exception, like Exception.
  • Type checking (Medium): Using isinstance() is the recommended, more robust way to check types in Python, as it correctly handles subclasses.

I've provided a single suggestion below that addresses all these points.

Suggested change
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except:
pass
if num_proc is None or not isinstance(num_proc, int):
import psutil
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
num_proc = 1 # Too risky with low memory
else:
num_proc = min(num_proc, int(memory_gb_left))
except Exception:
pass

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +487 to +490
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Import psutil before using it in num_proc default

When num_proc is left as None (the default) and the dataset is not an IterableDataset, this new path calls psutil.cpu_count() and psutil.virtual_memory() without importing psutil in standardize_data_formats(). Unlike the earlier helper, there is no local import psutil in this function and no module-level import, so this will raise NameError: name 'psutil' is not defined and crash the mapping call on the default path.

Useful? React with 👍 / 👎.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.

Comment on lines +487 to +490
if num_proc is None or type(num_proc) is not int:
# Use memory-aware default to avoid OOM crashes
num_proc = min(max(psutil.cpu_count()+4, 2), 64)
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.

try:
memory = psutil.virtual_memory()
memory_gb_left = memory.available / (1024 ** 3)
if memory_gb_left < 10:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if 10 is the correct threshold either. Would need to be tested, and I think on the T4 would likely force just 1 process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Exhaustion and Multiprocessing Crash in standardize_data_formats() When RAM is Nearly Full

2 participants