Skip to content

NPU双机多卡微调报HCCL错误 #6646

Open
@AlbertWang001

Description

Reminder

  • I have read the above rules and searched the existing issues.

System Info

使用docker-npu的方式构建镜像和容器,在进行双机16卡微调qwen2.5-7B的时候一直报HCCL错误,在容器内执行的命令如下:
torchrun --master_port 6001 --nproc_per_node=8 --nnodes=2 --node_rank=0
--master_addr=10.0.1.30 src/train.py
--stage sft
--model_name_or_path /home/model_bin/Qwen/Qwen2___5-7B-Instruct/
--do_train
--dataset alpaca_zh_demo
--template qwen
--finetuning_type lora
--output_dir saves/qwen-7b/lora/sft
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 500
--learning_rate 1e-4
--num_train_epochs 100.0
--plot_loss

Reproduction

具体错误日志如下图所示:
image
image

Others

No response

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingnpuThis problem is related to NPU devicespendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions