fix: return gradients on original device for multi-GPU training by Vinayyyy7 · Pull Request #423 · unslothai/unsloth-zoo

Vinayyyy7 · 2025-12-31T16:17:40Z

This PR fixes a gradient device mismatch error in UnslothFusedLoss that occurs during multi-GPU training.

when the model is split across GPUs via device_map="auto/balanced", hidden_states might be on cuda:1 while lm_head_weight is on cuda:0 as example. the forward pass moves hidden_states` to the lm_head device for computation, but backward pass was returning gradients on that same device instead of the original hidden_states device. This caused PyTorch to raise "RuntimeError: expected device cuda:1 but got cuda:0".

this fix tracks the original device of hidden_states before any tensor movement, saves it to the autograd context, and in backward() moves grad_inputs back to the original device before returning.

Note: This PR works together with a corresponding fix in `unsloth` that adds distributed training detection for vision models. you can see it for more info as I have made a PR both PRs this and that are essential for this fix.

unslothai/unsloth#3809

Tested on Kaggle with 2x T4 GPUs using Qwen2.5-VL-2B.

gemini-code-assist · 2025-12-31T16:17:55Z

Summary of Changes

Hello @Vinayyyy7, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical gradient device mismatch error encountered during multi-GPU training with UnslothFusedLoss. By carefully managing tensor device placement in the forward pass and ensuring gradients are returned to their originating device in the backward pass, this change enables robust and error-free distributed training setups, particularly when models are split across multiple GPUs.

Highlights

Multi-GPU Gradient Device Mismatch Fix: Resolved a RuntimeError in UnslothFusedLoss that occurred during multi-GPU training when device_map was set to 'auto' or 'balanced'. This error stemmed from gradients being returned on the wrong device during the backward pass.
Tensor Device Alignment in Forward Pass: Ensured that hidden_states and labels are explicitly moved to the same device as lm_head_weight before computation in the forward pass, preventing torch.func.grad_and_value failures due to device mismatches.
Preserving Original Device Context: The original device of hidden_states is now tracked and saved within the autograd context during the forward pass.
Correct Gradient Device Assignment in Backward Pass: In the backward pass, grad_inputs are now moved back to their original hidden_states device before being returned, rectifying the device mismatch error.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a critical multi-GPU training issue by ensuring that gradients are returned to their original device. The changes correctly identify the need to synchronize tensor devices during the forward pass and, crucially, to restore the grad_inputs to their initial device during the backward pass. This prevents RuntimeError due to device mismatches in scenarios involving device_map="auto/balanced". The implementation is clear, well-commented, and directly addresses the problem described.

alien087 · 2026-02-16T14:07:38Z

File "/home/pitai/.conda/envs/ft-io/lib/python3.11/site-packages/unsloth/models/rl_replacements.py", line 33, in
from unsloth_zoo.device_type import device_synchronize
ImportError: cannot import name 'device_synchronize' from 'unsloth_zoo.device_type

hi, can you help me? i got this error when impoty UnslothVisionDataCollator
from unsloth.trainer import UnslothVisionDataCollator

Vinayyyy7 · 2026-02-16T15:13:09Z

I've updated it with all changes that were implemented in 2 months by unsloth officially so should work now

try:

pip install git+https://github.com/Vinayyyy7/unsloth-zoo.git@fix/multi-gpu-gradient-device-mismatch
pip install git+https://github.com/Vinayyyy7/unsloth.git@fix/multi-gpu-vision-model-training

OR simply clone both repos do build the .whl files and install it that way

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

This was referenced Dec 31, 2025

fix: multi-GPU training support for vision models unslothai/unsloth#3809

Closed

[Feature] FastVisionModel unslothai/unsloth#3495

Open

Fresh Multi-GPU device consistency fixes for fused kernels

d9c1b3e

Vinayyyy7 force-pushed the fix/multi-gpu-gradient-device-mismatch branch from 97490bf to d9c1b3e Compare February 18, 2026 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: return gradients on original device for multi-GPU training#423

fix: return gradients on original device for multi-GPU training#423
Vinayyyy7 wants to merge 1 commit intounslothai:mainfrom
Vinayyyy7:fix/multi-gpu-gradient-device-mismatch

Vinayyyy7 commented Dec 31, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

alien087 commented Feb 16, 2026 •

edited

Loading

Uh oh!

Vinayyyy7 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Vinayyyy7 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: This PR works together with a corresponding fix in unsloth that adds distributed training detection for vision models. you can see it for more info as I have made a PR both PRs this and that are essential for this fix.

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

alien087 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vinayyyy7 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vinayyyy7 commented Dec 31, 2025 •

edited

Loading

Note: This PR works together with a corresponding fix in `unsloth` that adds distributed training detection for vision models. you can see it for more info as I have made a PR both PRs this and that are essential for this fix.

alien087 commented Feb 16, 2026 •

edited

Loading