fix: multi-GPU training support for vision models#3809
fix: multi-GPU training support for vision models#3809Vinayyyy7 wants to merge 2 commits intounslothai:mainfrom
Conversation
Summary of ChangesHello @Vinayyyy7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues preventing successful multi-GPU training for vision models when using automatic or balanced device mapping. By intelligently detecting distributed training environments, the system now defaults to a data-parallel strategy, ensuring each GPU receives a complete model copy. This prevents model splitting across devices, which previously led to gradient calculation errors and Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue with multi-GPU training for vision models by correctly handling device_map in distributed environments. The fix prevents model splitting across GPUs, which was causing runtime errors. The overall approach is sound. However, the implementation introduces code duplication between FastModel.from_pretrained and FastBaseModel.from_pretrained. My review includes suggestions to refactor this by removing the redundant code and centralizing the logic in the base model, which will improve maintainability and efficiency.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hey @Vinayyyy7 thanks for the contribution thoughts @danielhanchen ? |
|
agree that erroring out is cleaner than silently overriding. Users should explicitly knoww what's happening. clarifications: Splitting model across GPUs always crashed due to gradient mismatch errors, companion PR in unsloth-zoo (unslothai/unsloth-zoo#423) handles this by tracking the original device and moving gradients back. But even with that fix, model splitting has other issues with distributed training and quantization What device_map we can suggest users:
Proposed error message: raise ValueError(
f"Unsloth: device_map='{device_map}' is not supported for FastVisionModel in multi-GPU training. "
f"Model splitting across GPUs causes gradient device mismatch errors. "
f"For multi-GPU training, remove device_map or set device_map=None to use data-parallel mode "
f"where each GPU loads a full copy of the model. "
f"Note: This uses more VRAM per GPU but provides equivalent training to single GPU."
)Is this warning and
|
|
Hey @Vinayyyy7 |
|
it's just an if condition since we might warn users saying auto/balaned not supported in note: right know this is not implemented in PR, rn it just directly does data parallel if detects auto/balanced. |
3a47492 to
2550f94
Compare
|
SOME UPDATES as tested by another user, their issues were addressed Fixes
AttributeError: 'DistributedDataParallel' ... has no attribute 'config'
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces fixes to support multi-GPU training for vision models. The change in unsloth/models/rl_replacements.py correctly unwraps the model in a distributed setting, which is a good fix. The main change in unsloth/models/vision.py adds a check to prevent model splitting across GPUs, which is known to cause issues. My primary feedback is on the implementation of this check. Instead of raising an error, which requires manual intervention from the user, I recommend automatically overriding the device_map and issuing a warning. This would be more user-friendly and aligns better with the pull request's description of overriding the setting.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for multi-GPU training with vision models by preventing the use of device_map settings that lead to model splitting, which is unsupported. The changes correctly raise an informative error to guide users. Additionally, it includes patches to unwrap DDP models for GRPO training. My review focuses on ensuring the robustness of these DDP patches. I've identified one potential issue where a model is unwrapped without a necessary safety check, which could lead to runtime errors.
Datta0
left a comment
There was a problem hiding this comment.
Sorry for the delay. I somehow forgot about this PR
|
I have a similar problem when I want to train qwen3 vl 30B on 2x L40s, I want to split the model to fit on my 2 GPUs but the exact error here appears, does this solve the problem? |
Yes, I have personally tested it with qwen3-vl models and internVL models it worked on multi-gpu another user reported they had some issues with GRPO I remember so those were also fixed, they did some detailed tests which showed DDP worked out |
so it'll work for model sharding too? because i'll split the model into 2 GPUs because i can't fit on 1xL40s, so i want to split the model into 2x L40s? Sorry for asking a lot of questions, I just want to make sure before trying it. |
if you are trying to do model-parallelism like sharding the model into multiple GPUs (that's what unsloth do for normal text-genration models which works nicely) but for vision models like qwen etc it does not support it, (might do another PR to support that in future) So this implementation is made to make it work, it use data-parallelism instead as a little trade-off it splits training examples from dataset, although it loads the model entirely on both GPUs the training is faster compared to single GPU setting device_map=None enables this ( I would recommened that have GPUs that can fit the model fully, I know this approch is kinda dumb but good at the same time |
|
@Vinayyyy7 I just tried testing this PR with qwen3 vl gspo, and was still getting an error. Can you show me the script you used to test? |
fec27be to
9e854a9
Compare
for more information, see https://pre-commit.ci
|
Yes this PR was made few days before unsloth released GSPO training support, and since then a lot of changes were made + different migration dependency updates stuff, so the code likely broke not even my old training code is working right now, SO I've decided to start fresh and apply the fix again |
|
Thanks! Should we close this one and start a fresh one? |
Sure, a NEW PR would be better |
This PR fixes multi-GPU training for vision models when using
device_map="auto"ordevice_map="balanced".when running on multiple GPUs, setting
device_map="auto/balanced"withFastVisionModelClass causes model to be split across devices. During training, this results in hidden_states being on one GPU (e.g., cuda:1) while lm_head is on another (eg: cuda:0). The fused cross-entropy loss then computes gradients on the lm_head device but PyTorch expects them back on the original hidden_states device, causing a RuntimeError.Errors we see:
OR
this fix adds distributed training detection in
FastBaseModel.from_pretrained().when distributed training is detected anddevice_mapis set toauto/balanced, it it gives a ValueError saying to use data-parallel mode where each GPU loads a full copy of the model.Note: This PR works together with a corresponding fix in
unsloth-zoothat handles the gradient device mismatch in the fused CE loss.unslothai/unsloth-zoo#423
Tested on Kaggle with 2x T4 GPUs using
Qwen/Qwen3-VL-2B-InstructResult: IT WORKED