-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
System Info / 系統信息
Cuda 11.5
Transformers 4.40.2
Python 3.12.2
GPU 4090 单卡
内存 32GB
系统:windows wsl2
Who can help? / 谁可以帮助到您?
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
lora_finetune.ipynb 运行 !CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /home/xiebeichen/chatglm3-6b configs/lora.yaml
输出
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:03<00:00, 2.02it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
...
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
0%| | 0/3000 [00:00<?, ?it/s]
会卡在开始训练时,0/3000 这个节点,没有报错,nvidia-smi 运行发现显卡并没有被调用
Expected behavior / 期待表现
正常调用 gpu,正常训练