Skip to content

Run multi-node training failed, how to train without hostfile #381

Open
@xiaoyi0814

Description

@xiaoyi0814

I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh
But it seems that the multi-nodes are not launched successfully and a warning in the log as below:

2023-04-21 03:19:45,810] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 03:19:52,167] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-21 03:19:52,167] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-21 03:19:52,167] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-21 03:19:52,167] [INFO] [launch.py:247:main] dist_world_size=4
[2023-04-21 03:19:52,167] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?

Thanks!

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatsystemAn issue with a environment/system setup.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions