Convert n_items type from torch.Tensor to int #139

MoonRainy21 · 2025-05-15T03:41:16Z

Change the type of n_items from a tensor to an integer.
On compiler.py, type of n_items is shown as int, but the variable passed into the model was typed torch.Tensor.
It may not be issue in most of cases, but it may cause issues related to device. (such as n_items on cuda:0 and operate with loss on cuda:7)

danielhanchen · 2025-05-16T00:12:18Z

Thanks for the PR! Would the previous line if device is not None and torch.is_tensor(num_items_in_batch): num_items_in_batch = num_items_in_batch.to(device) solve that issue?

MoonRainy21 · 2025-05-16T08:20:31Z

If there's only a single device, yes.
However, if I run with multi GPU (not officially supported though), this caused error since the num_items_in_batch occurs on loss_utils.py:182, where loss could be on cuda:7 while num_items_in_batch would be on cuda:0
Also, the function is expecting num_items_in_batch to be int!
Thank you

torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
    hidden_states      : torch.Tensor,
    lm_weight          : torch.Tensor,
    labels             : torch.Tensor,
    num_items_in_batch : int = None,
    ignore_index       : int = -100,
    reduction          : str = "mean",
    logit_softcapping  : float = 0,
    accuracy_threshold : str = "auto",
):
    # All Unsloth Zoo code licensed under LGPLv3
    reduction = "sum" if num_items_in_batch is not None else "mean"
    if logit_softcapping == 0: logit_softcapping = None
    with torch_cuda_device(lm_weight.device):
        loss = linear_cross_entropy(
            hidden_states.to(lm_weight.dtype),
            lm_weight,
            targets      = labels,
            ignore_index = ignore_index,
            softcap      = logit_softcapping,
            reduction    = reduction,
            shift        = True,
            filter_eps   = accuracy_threshold,
        )
    if num_items_in_batch is not None: loss = loss / num_items_in_batch
    return loss
pass

MoonRainy21 · 2025-05-19T07:47:49Z

I’ve tested multi-GPU training using my fork, and everything seems to be working well. @danielhanchen, do you have any additional comments or suggestions?

danielhanchen · 2025-05-25T10:30:08Z

@MoonRainy21 Apologies on the delay - your PR is correct yes, but I'm worried this'll make training slower due to CPU->GPU communication. @Erland366 was working on seeing if we can remove this bottleneck.

MoonRainy21 · 2025-05-25T10:45:06Z

Then we might have to change the type of all num_items_in_batch into torch.Tensor and move its device to where loss is calculated or any other place it was used. Do you think that would work?

danielhanchen · 2025-05-25T10:55:28Z

@MoonRainy21 I'm assuming if num_items_in_batch is not None: loss = loss / num_items_in_batch.to(loss.device) might work maybe?

danielhanchen · 2025-05-25T10:56:33Z

You can ignore the num_items_in_batch : int = None,

Maybe better to do:

with torch_cuda_device(lm_weight.device):
        loss = linear_cross_entropy(
            hidden_states.to(lm_weight.dtype),
            lm_weight,
            targets      = labels,
            ignore_index = ignore_index,
            softcap      = logit_softcapping,
            reduction    = reduction,
            shift        = True,
            filter_eps   = accuracy_threshold,
        )
    if num_items_in_batch is not None:
        if torch.is_tensor(num_items_in_batch):
            num_items_in_batch = num_items_in_batch.to(loss.device)
        loss = loss / num_items_in_batch
    return loss

Erland366 · 2025-05-25T21:47:37Z

I want to discuss abit about this

I tested the behavior in vanilla HuggingFace and it also got the same issue :

Here's my testing on Kaggle notebook on 2 T4 -> https://www.kaggle.com/code/erlandpg/test-multigpu-bitsandbytes

I tested to move the num_items_in_batch to the loss device and it works. but GPU utilization is around 20%. I need more testing whether this is a good number of utilization? Also whether we want to do this at all or not since HF itself did not support it.

cc: @danielhanchen

MoonRainy21 · 2025-05-26T00:00:33Z

It seems there's code for the device of num_items_in_batch on loss_utils.py:304. I wasn't able to understand when _unsloth_get_batch_samples is called, but it seems it would become unnecessary.

MoonRainy21 · 2025-05-26T00:03:41Z

@Erland366 For utilization, GPU utilization of the running GPU was pretty high for me (~80%) when I have tried with higher batch size but only one of the GPUs were running. We might need pipeline parallelism support for better utilization.

cc: @danielhanchen

…here they are calculated with.

…ny21/unsloth-zoo into fix/num_batch_items-type

MoonRainy21 · 2025-05-26T23:49:21Z

Have tested latest commit with 8 GPUs training Qwen3 235B A22B, but not really sure about some code which seems to be string and used for making unsloth compiled cache

Erland366 · 2025-05-27T18:18:55Z

@Erland366 For utilization, GPU utilization of the running GPU was pretty high for me (~80%) when I have tried with higher batch size but only one of the GPUs were running. We might need pipeline parallelism support for better utilization.

cc: @danielhanchen

Communication is really2 slow especially in my settings where communication is through PCIE and not NVLink nor Infiniband. If I only use 1 GPU, I got around 75% or so but if 2 GPUs, it drop down to 20%. I need to investigate what's the GPU utilization if I use NVLink system which based on this paper should be around 60% -> https://arxiv.org/abs/2505.12832v1

We cannot move forward into using integer since num_items_in_batch should be able to be called using all-gather if we. have different number of token across GPUs. This is actually created because of gradient accumulation bug found by Unsloth team (See this article -> https://muellerzr.github.io/blog/gradient_accumulation_part2.html#problem-distributed-training , you see that they called accelerator.gather on num_items_in_batch)

moving into loss.device seems the option here but I do not know if we're moving forward into that solution (since HF itself did not do that, perhaps for a reason)

MoonRainy21 · 2025-05-28T06:23:39Z

In case you are interested, my setting was 8 A100 SXM (nvlink connection) and performed around 50% to 70% when the tensor arrives. I'm trying to use FSDP to use pipeline parallelism in order to see the utilization on multi GPUs

danielhanchen · 2025-05-28T10:57:52Z

Wait @Erland366 currently the code is:

torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
    hidden_states      : torch.Tensor,
    lm_weight          : torch.Tensor,
    labels             : torch.Tensor,
    num_items_in_batch : int = None,
    ignore_index       : int = -100,
    reduction          : str = "mean",
    logit_softcapping  : float = 0,
    accuracy_threshold : str = "auto",
):
    # All Unsloth Zoo code licensed under LGPLv3
    reduction = "sum" if num_items_in_batch is not None else "mean"
    if logit_softcapping == 0: logit_softcapping = None
    with torch_cuda_device(lm_weight.device):
        loss = linear_cross_entropy(
            hidden_states.to(lm_weight.dtype),
            lm_weight,
            targets      = labels,
            ignore_index = ignore_index,
            softcap      = logit_softcapping,
            reduction    = reduction,
            shift        = True,
            filter_eps   = accuracy_threshold,
        )
    if num_items_in_batch is not None: loss = loss / num_items_in_batch
    return loss
pass

are you saying you also managed to test:

torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
    hidden_states      : torch.Tensor,
    lm_weight          : torch.Tensor,
    labels             : torch.Tensor,
    num_items_in_batch : int = None,
    ignore_index       : int = -100,
    reduction          : str = "mean",
    logit_softcapping  : float = 0,
    accuracy_threshold : str = "auto",
):
    # All Unsloth Zoo code licensed under LGPLv3
    reduction = "sum" if num_items_in_batch is not None else "mean"
    if logit_softcapping == 0: logit_softcapping = None
    with torch_cuda_device(lm_weight.device):
        loss = linear_cross_entropy(
            hidden_states.to(lm_weight.dtype),
            lm_weight,
            targets      = labels,
            ignore_index = ignore_index,
            softcap      = logit_softcapping,
            reduction    = reduction,
            shift        = True,
            filter_eps   = accuracy_threshold,
        )
    if num_items_in_batch is not None:
        if torch.is_tensor(num_items_in_batch):
            num_items_in_batch = num_items_in_batch.to(loss.device)
        loss = loss / num_items_in_batch
    return loss
pass

MoonRainy21 force-pushed the fix/num_batch_items-type branch 3 times, most recently from b87adbf to d8bad22 Compare May 15, 2025 08:04

Erland366 mentioned this pull request May 25, 2025

Debug: num_items_in_batch on a different device from loss. #147

Open

Updated device of num_items_in_batch and n_items (same one) to loss w…

facf34d

…here they are calculated with.

MoonRainy21 force-pushed the fix/num_batch_items-type branch from 4dbb0da to facf34d Compare May 26, 2025 00:19

MoonRainy21 and others added 2 commits May 26, 2025 00:19

Updated device of num_items_in_batch and n_items (same one) to loss w…

f78ca0a

…here they are calculated with.

Merge branch 'fix/num_batch_items-type' of https://github.com/MoonRai…

0a51436

…ny21/unsloth-zoo into fix/num_batch_items-type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert n_items type from torch.Tensor to int #139

Convert n_items type from torch.Tensor to int #139

MoonRainy21 commented May 15, 2025

Uh oh!

danielhanchen commented May 16, 2025

Uh oh!

MoonRainy21 commented May 16, 2025

Uh oh!

MoonRainy21 commented May 19, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

MoonRainy21 commented May 25, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

Erland366 commented May 25, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

Erland366 commented May 27, 2025 •

edited

Loading

Uh oh!

MoonRainy21 commented May 28, 2025

Uh oh!

danielhanchen commented May 28, 2025

Uh oh!

Uh oh!

Convert n_items type from torch.Tensor to int #139

Are you sure you want to change the base?

Convert n_items type from torch.Tensor to int #139

Conversation

MoonRainy21 commented May 15, 2025

Uh oh!

danielhanchen commented May 16, 2025

Uh oh!

MoonRainy21 commented May 16, 2025

Uh oh!

MoonRainy21 commented May 19, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

MoonRainy21 commented May 25, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

Erland366 commented May 25, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

MoonRainy21 commented May 26, 2025

Uh oh!

Erland366 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MoonRainy21 commented May 28, 2025

Uh oh!

danielhanchen commented May 28, 2025

Uh oh!

Uh oh!

Erland366 commented May 27, 2025 •

edited

Loading