fineune_hf.py 运行卡在开始训练时

### System Info / 系統信息

Cuda 11.5
Transformers 4.40.2
Python 3.12.2
GPU 4090 单卡
内存 32GB
系统：windows wsl2

### Who can help? / 谁可以帮助到您？

@Btlmd

### Information / 问题信息

- [X] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

lora_finetune.ipynb 运行 !CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py  data/AdvertiseGen_fix  /home/xiebeichen/chatglm3-6b  configs/lora.yaml

输出

Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:03<00:00,  2.02it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 114599
})
val_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
test_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
--> Sanity check
           '[gMASK]': 64790 -> -100
               'sop': 64792 -> -100
          '<|user|>': 64795 -> -100
...
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 3,000
  Number of trainable parameters = 1,949,696
  0%|                                                  | 0/3000 [00:00<?, ?it/s]

会卡在开始训练时，0/3000 这个节点，没有报错，nvidia-smi 运行发现显卡并没有被调用

### Expected behavior / 期待表现

正常调用 gpu，正常训练

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fineune_hf.py 运行卡在开始训练时 #1345

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fineune_hf.py 运行卡在开始训练时 #1345

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions