Add memory-aware num_proc limiting in standardize_data_formats#418
Add memory-aware num_proc limiting in standardize_data_formats#418yurekami wants to merge 1 commit intounslothai:mainfrom
Conversation
The standardize_data_formats function was using cpu_count() without any memory-based limiting, which could cause OOM crashes when processing large datasets on systems with limited RAM. This fix applies the same memory-aware pattern already used elsewhere in the codebase (e.g., train_on_responses_only at lines 333-340): - Caps num_proc at 64 to avoid excessive process spawning - Checks available memory and reduces workers accordingly - Falls back to 1 worker if available memory < 10GB Fixes unslothai#87 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary of ChangesHello @yurekami, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial enhancement to the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a memory-aware limit for num_proc in standardize_data_formats to prevent out-of-memory errors, which is a great improvement for handling large datasets. My review includes a critical fix for a missing import that would cause a runtime error, along with suggestions to improve error handling and adhere to Python best practices for type checking.
| if num_proc is None or type(num_proc) is not int: | ||
| # Use memory-aware default to avoid OOM crashes | ||
| num_proc = min(max(psutil.cpu_count()+4, 2), 64) | ||
| try: | ||
| memory = psutil.virtual_memory() | ||
| memory_gb_left = memory.available / (1024 ** 3) | ||
| if memory_gb_left < 10: | ||
| num_proc = 1 # Too risky with low memory | ||
| else: | ||
| num_proc = min(num_proc, int(memory_gb_left)) | ||
| except: | ||
| pass |
There was a problem hiding this comment.
This block has a few areas for improvement:
- Missing import (Critical): The
psutilmodule is used without being imported, which will cause aNameErrorat runtime. The import should be added beforepsutilis used. - Broad exception clause (High): The bare
except:is too broad and can hide important errors by catching system-level exceptions likeKeyboardInterrupt. It's safer to catch a more specific exception, likeException. - Type checking (Medium): Using
isinstance()is the recommended, more robust way to check types in Python, as it correctly handles subclasses.
I've provided a single suggestion below that addresses all these points.
| if num_proc is None or type(num_proc) is not int: | |
| # Use memory-aware default to avoid OOM crashes | |
| num_proc = min(max(psutil.cpu_count()+4, 2), 64) | |
| try: | |
| memory = psutil.virtual_memory() | |
| memory_gb_left = memory.available / (1024 ** 3) | |
| if memory_gb_left < 10: | |
| num_proc = 1 # Too risky with low memory | |
| else: | |
| num_proc = min(num_proc, int(memory_gb_left)) | |
| except: | |
| pass | |
| if num_proc is None or not isinstance(num_proc, int): | |
| import psutil | |
| # Use memory-aware default to avoid OOM crashes | |
| num_proc = min(max(psutil.cpu_count()+4, 2), 64) | |
| try: | |
| memory = psutil.virtual_memory() | |
| memory_gb_left = memory.available / (1024 ** 3) | |
| if memory_gb_left < 10: | |
| num_proc = 1 # Too risky with low memory | |
| else: | |
| num_proc = min(num_proc, int(memory_gb_left)) | |
| except Exception: | |
| pass |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if num_proc is None or type(num_proc) is not int: | ||
| # Use memory-aware default to avoid OOM crashes | ||
| num_proc = min(max(psutil.cpu_count()+4, 2), 64) | ||
| try: |
There was a problem hiding this comment.
Import psutil before using it in num_proc default
When num_proc is left as None (the default) and the dataset is not an IterableDataset, this new path calls psutil.cpu_count() and psutil.virtual_memory() without importing psutil in standardize_data_formats(). Unlike the earlier helper, there is no local import psutil in this function and no module-level import, so this will raise NameError: name 'psutil' is not defined and crash the mapping call on the default path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.
| if num_proc is None or type(num_proc) is not int: | ||
| # Use memory-aware default to avoid OOM crashes | ||
| num_proc = min(max(psutil.cpu_count()+4, 2), 64) | ||
| try: |
There was a problem hiding this comment.
Yes you'll need to explicitly import psutil. and we can use isinstance instead of type(num_proc). It's also possible psutil.cpu_count() return None so this could might error.
| try: | ||
| memory = psutil.virtual_memory() | ||
| memory_gb_left = memory.available / (1024 ** 3) | ||
| if memory_gb_left < 10: |
There was a problem hiding this comment.
I'm not sure if 10 is the correct threshold either. Would need to be tested, and I think on the T4 would likely force just 1 process.
Summary
num_procinstandardize_data_formats()to prevent OOM crashestrain_on_responses_only(lines 333-340)num_procto override the automatic limitingProblem
When processing large datasets like
mlabonne/FineTome-100k, thestandardize_data_formats()function would spawn workers equal to CPU count without considering available RAM. This caused:Solution
Fixes #87
🤖 Generated with Claude Code