Skip to content

Comments

fix: return gradients on original device for multi-GPU training#423

Open
Vinayyyy7 wants to merge 1 commit intounslothai:mainfrom
Vinayyyy7:fix/multi-gpu-gradient-device-mismatch
Open

fix: return gradients on original device for multi-GPU training#423
Vinayyyy7 wants to merge 1 commit intounslothai:mainfrom
Vinayyyy7:fix/multi-gpu-gradient-device-mismatch

Conversation

@Vinayyyy7
Copy link

@Vinayyyy7 Vinayyyy7 commented Dec 31, 2025

This PR fixes a gradient device mismatch error in UnslothFusedLoss that occurs during multi-GPU training.

when the model is split across GPUs via device_map="auto/balanced", hidden_states might be on cuda:1 while lm_head_weight is on cuda:0 as example. the forward pass moves hidden_states` to the lm_head device for computation, but backward pass was returning gradients on that same device instead of the original hidden_states device. This caused PyTorch to raise "RuntimeError: expected device cuda:1 but got cuda:0".

this fix tracks the original device of hidden_states before any tensor movement, saves it to the autograd context, and in backward() moves grad_inputs back to the original device before returning.

Note: This PR works together with a corresponding fix in unsloth that adds distributed training detection for vision models. you can see it for more info as I have made a PR both PRs this and that are essential for this fix.

unslothai/unsloth#3809

Tested on Kaggle with 2x T4 GPUs using Qwen2.5-VL-2B.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Vinayyyy7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical gradient device mismatch error encountered during multi-GPU training with UnslothFusedLoss. By carefully managing tensor device placement in the forward pass and ensuring gradients are returned to their originating device in the backward pass, this change enables robust and error-free distributed training setups, particularly when models are split across multiple GPUs.

Highlights

  • Multi-GPU Gradient Device Mismatch Fix: Resolved a RuntimeError in UnslothFusedLoss that occurred during multi-GPU training when device_map was set to 'auto' or 'balanced'. This error stemmed from gradients being returned on the wrong device during the backward pass.
  • Tensor Device Alignment in Forward Pass: Ensured that hidden_states and labels are explicitly moved to the same device as lm_head_weight before computation in the forward pass, preventing torch.func.grad_and_value failures due to device mismatches.
  • Preserving Original Device Context: The original device of hidden_states is now tracked and saved within the autograd context during the forward pass.
  • Correct Gradient Device Assignment in Backward Pass: In the backward pass, grad_inputs are now moved back to their original hidden_states device before being returned, rectifying the device mismatch error.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical multi-GPU training issue by ensuring that gradients are returned to their original device. The changes correctly identify the need to synchronize tensor devices during the forward pass and, crucially, to restore the grad_inputs to their initial device during the backward pass. This prevents RuntimeError due to device mismatches in scenarios involving device_map="auto/balanced". The implementation is clear, well-commented, and directly addresses the problem described.

@alien087
Copy link

alien087 commented Feb 16, 2026

File "/home/pitai/.conda/envs/ft-io/lib/python3.11/site-packages/unsloth/models/rl_replacements.py", line 33, in
from unsloth_zoo.device_type import device_synchronize
ImportError: cannot import name 'device_synchronize' from 'unsloth_zoo.device_type

hi, can you help me? i got this error when impoty UnslothVisionDataCollator
from unsloth.trainer import UnslothVisionDataCollator

@Vinayyyy7
Copy link
Author

I've updated it with all changes that were implemented in 2 months by unsloth officially so should work now

try:

pip install git+https://github.com/Vinayyyy7/unsloth-zoo.git@fix/multi-gpu-gradient-device-mismatch
pip install git+https://github.com/Vinayyyy7/unsloth.git@fix/multi-gpu-vision-model-training

OR simply clone both repos do build the .whl files and install it that way

@Vinayyyy7 Vinayyyy7 force-pushed the fix/multi-gpu-gradient-device-mismatch branch from 97490bf to d9c1b3e Compare February 18, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants