Open
Description
I encountered a bug during the usage of composer.utils.dist.get_node_signal_file_name
.
Setup
- llm-foundry==release/v0.17.1
If I execute a training script on a single node I have no issue and the training starts smoothly. When I set up the multinode configuration, an error comes out.
[rank0]: Traceback (most recent call last):
[rank0]: File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/scripts/train/train.py", line 9, in <module>
[rank0]: train_from_yaml(yaml_path, args_list)
[rank0]: File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]: return train(yaml_cfg)
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 366, in train
[rank0]: tokenizer = build_tokenizer(tokenizer_name, tokenizer_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/utils/builders.py", line 545, in build_tokenizer
[rank0]: os.remove(signal_file_path)
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '._signal_file_node0_KfmwNg'
After a bit of debugging, I noticed that the get_node_signal_file_name
return the same name for each node, resulting in a race condition, since each node use the same file to assess inter-node concurrency.
I fixed such error using a previous methodology:
llmfoundry/utils/builders.py line:497
f'.node_{dist.get_node_rank()}_local_rank0_completed_tokenizer_setup'
I think that is something related to the composer
library. If this workaround is something sound, I can open an pull request.