Skip to content

Debug: num_items_in_batch on a different device from loss. #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KimRass
Copy link

@KimRass KimRass commented May 23, 2025

The code I committed will resolve errors such as:

Traceback (most recent call last):
    File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
    File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
    File "/home/eric/workspace/Qwen-SFT/sft_unsloth.py", line 121, in <module>
        trainer_stats = trainer.train()
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
        return inner_training_loop(
    File "<string>", line 315, in _fast_inner_training_loop
    File "<string>", line 31, in _unsloth_training_step
    File "/home/eric/workspace/Qwen-SFT/unsloth_compiled_cache/UnslothSFTTrainer.py", line 748, in compute_loss
        outputs = super().compute_loss(
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/unsloth/models/_utils.py", line 1043, in _unsloth_pre_compute_loss
        outputs = self._old_compute_loss(model, inputs, *args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/transformers/trainer.py", line 3801, in compute_loss
        outputs = model(**inputs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
        return self._call_impl(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
        return forward_call(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/accelerate/utils/operations.py", line 818, in forward
        return model_forward(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/accelerate/utils/operations.py", line 806, in __call__
        return convert_to_fp32(self.model_forward(*args, **kwargs))
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
        return func(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/peft/peft_model.py", line 1757, in forward
        return self.base_model(
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
        return self._call_impl(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
        return inner()
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1793, in inner
        result = forward_call(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
        return self.model.forward(*args, **kwargs)
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/accelerate/hooks.py", line 175, in new_forward
        output = module._old_forward(*args, **kwargs)
    File "/home/eric/workspace/Qwen-SFT/unsloth_compiled_cache/unsloth_compiled_module_qwen2_5_vl.py", line 1365, in forward
        return Qwen2_5_VLForConditionalGeneration_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, rope_deltas, cache_position, second_per_grid_ts, **loss_kwargs)
    File "/home/eric/workspace/Qwen-SFT/unsloth_compiled_cache/unsloth_compiled_module_qwen2_5_vl.py", line 1028, in Qwen2_5_VLForConditionalGeneration_forward
        loss = fused_linear_cross_entropy(
    File "/home/eric/workspace/venv/vitlp/lib/python3.10/site-packages/unsloth_zoo/loss_utils.py", line 188, in fused_linear_cross_entropy
        if num_items_in_batch is not None: loss = loss / num_items_in_batch
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

@danielhanchen
Copy link
Contributor

Thanks @KimRass !

@Erland366 Was this relevant to the multi GPU issues you were experiencing?

@Erland366
Copy link
Collaborator

I already explain the problem here in this comment, kindly check them and continue the discussion there -> #139 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants