Open
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
使用docker-npu的方式构建镜像和容器,在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误,在容器内执行的命令如下:
torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0
--master_addr=10.0.1.30 src/train.py
--stage sft
--model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/
--do_train
--dataset alpaca_zh_demo
--template qwen
--finetuning_type lora
--output_dir saves/qwen-7b/lora/sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 500
--learning_rate 1e-4
--num_train_epochs 100.0
--plot_loss
Reproduction
Others
No response