Run multi-node training failed, how to train without hostfile

I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh
But it seems that the multi-nodes are not launched successfully and a warning in the log as below:
```
2023-04-21 03:19:45,810] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 03:19:52,167] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-21 03:19:52,167] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-21 03:19:52,167] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-21 03:19:52,167] [INFO] [launch.py:247:main] dist_world_size=4
[2023-04-21 03:19:52,167] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
```
However, I cant get the IP for each node before I started the training. The GPUs are allocated when you are starting a task, so I can't set the hostfile before. How can I use the multi-nodes training just like torch did? Just pass the master_addr and master_port?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run multi-node training failed, how to train without hostfile #381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run multi-node training failed, how to train without hostfile #381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions