-
Notifications
You must be signed in to change notification settings - Fork 107
Convert n_items type from torch.Tensor to int #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b87adbf
to
d8bad22
Compare
Thanks for the PR! Would the previous line |
If there's only a single device, yes. torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
hidden_states : torch.Tensor,
lm_weight : torch.Tensor,
labels : torch.Tensor,
num_items_in_batch : int = None,
ignore_index : int = -100,
reduction : str = "mean",
logit_softcapping : float = 0,
accuracy_threshold : str = "auto",
):
# All Unsloth Zoo code licensed under LGPLv3
reduction = "sum" if num_items_in_batch is not None else "mean"
if logit_softcapping == 0: logit_softcapping = None
with torch_cuda_device(lm_weight.device):
loss = linear_cross_entropy(
hidden_states.to(lm_weight.dtype),
lm_weight,
targets = labels,
ignore_index = ignore_index,
softcap = logit_softcapping,
reduction = reduction,
shift = True,
filter_eps = accuracy_threshold,
)
if num_items_in_batch is not None: loss = loss / num_items_in_batch
return loss
pass |
I’ve tested multi-GPU training using my fork, and everything seems to be working well. @danielhanchen, do you have any additional comments or suggestions? |
@MoonRainy21 Apologies on the delay - your PR is correct yes, but I'm worried this'll make training slower due to CPU->GPU communication. @Erland366 was working on seeing if we can remove this bottleneck. |
Then we might have to change the type of all |
@MoonRainy21 I'm assuming |
You can ignore the Maybe better to do: with torch_cuda_device(lm_weight.device):
loss = linear_cross_entropy(
hidden_states.to(lm_weight.dtype),
lm_weight,
targets = labels,
ignore_index = ignore_index,
softcap = logit_softcapping,
reduction = reduction,
shift = True,
filter_eps = accuracy_threshold,
)
if num_items_in_batch is not None:
if torch.is_tensor(num_items_in_batch):
num_items_in_batch = num_items_in_batch.to(loss.device)
loss = loss / num_items_in_batch
return loss |
I want to discuss abit about this I tested the behavior in vanilla HuggingFace and it also got the same issue : Here's my testing on Kaggle notebook on 2 T4 -> https://www.kaggle.com/code/erlandpg/test-multigpu-bitsandbytes I tested to move the cc: @danielhanchen |
It seems there's code for the device of |
@Erland366 For utilization, GPU utilization of the running GPU was pretty high for me (~80%) when I have tried with higher batch size but only one of the GPUs were running. We might need pipeline parallelism support for better utilization. cc: @danielhanchen |
…here they are calculated with.
4dbb0da
to
facf34d
Compare
…here they are calculated with.
…ny21/unsloth-zoo into fix/num_batch_items-type
Have tested latest commit with 8 GPUs training Qwen3 235B A22B, but not really sure about some code which seems to be string and used for making unsloth compiled cache |
Communication is really2 slow especially in my settings where communication is through PCIE and not NVLink nor Infiniband. If I only use 1 GPU, I got around 75% or so but if 2 GPUs, it drop down to 20%. I need to investigate what's the GPU utilization if I use NVLink system which based on this paper should be around 60% -> https://arxiv.org/abs/2505.12832v1 We cannot move forward into using integer since moving into |
In case you are interested, my setting was 8 A100 SXM (nvlink connection) and performed around 50% to 70% when the tensor arrives. I'm trying to use FSDP to use pipeline parallelism in order to see the utilization on multi GPUs |
Wait @Erland366 currently the code is: torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
hidden_states : torch.Tensor,
lm_weight : torch.Tensor,
labels : torch.Tensor,
num_items_in_batch : int = None,
ignore_index : int = -100,
reduction : str = "mean",
logit_softcapping : float = 0,
accuracy_threshold : str = "auto",
):
# All Unsloth Zoo code licensed under LGPLv3
reduction = "sum" if num_items_in_batch is not None else "mean"
if logit_softcapping == 0: logit_softcapping = None
with torch_cuda_device(lm_weight.device):
loss = linear_cross_entropy(
hidden_states.to(lm_weight.dtype),
lm_weight,
targets = labels,
ignore_index = ignore_index,
softcap = logit_softcapping,
reduction = reduction,
shift = True,
filter_eps = accuracy_threshold,
)
if num_items_in_batch is not None: loss = loss / num_items_in_batch
return loss
pass are you saying you also managed to test: torch_cuda_device = torch.cuda.device
def fused_linear_cross_entropy(
hidden_states : torch.Tensor,
lm_weight : torch.Tensor,
labels : torch.Tensor,
num_items_in_batch : int = None,
ignore_index : int = -100,
reduction : str = "mean",
logit_softcapping : float = 0,
accuracy_threshold : str = "auto",
):
# All Unsloth Zoo code licensed under LGPLv3
reduction = "sum" if num_items_in_batch is not None else "mean"
if logit_softcapping == 0: logit_softcapping = None
with torch_cuda_device(lm_weight.device):
loss = linear_cross_entropy(
hidden_states.to(lm_weight.dtype),
lm_weight,
targets = labels,
ignore_index = ignore_index,
softcap = logit_softcapping,
reduction = reduction,
shift = True,
filter_eps = accuracy_threshold,
)
if num_items_in_batch is not None:
if torch.is_tensor(num_items_in_batch):
num_items_in_batch = num_items_in_batch.to(loss.device)
loss = loss / num_items_in_batch
return loss
pass |
Change the type of n_items from a tensor to an integer.
On
compiler.py
, type ofn_items
is shown asint
, but the variable passed into the model was typedtorch.Tensor
.It may not be issue in most of cases, but it may cause issues related to device. (such as
n_items
oncuda:0
and operate withloss
oncuda:7
)