Skip to content

Usage of composer.utils.dist.get_node_signal_file_name() #1736

Open
@Andrew-Wyn

Description

@Andrew-Wyn

I encountered a bug during the usage of composer.utils.dist.get_node_signal_file_name.

Setup

  • llm-foundry==release/v0.17.1

If I execute a training script on a single node I have no issue and the training starts smoothly. When I set up the multinode configuration, an error comes out.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/scripts/train/train.py", line 9, in <module>
[rank0]:     train_from_yaml(yaml_path, args_list)
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]:     return train(yaml_cfg)
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/command_utils/train.py", line 366, in train
[rank0]:     tokenizer = build_tokenizer(tokenizer_name, tokenizer_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/leonardo/home/userexternal/lmoroni0/__Work/llm-foundry/llmfoundry/utils/builders.py", line 545, in build_tokenizer
[rank0]:     os.remove(signal_file_path)
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '._signal_file_node0_KfmwNg'

After a bit of debugging, I noticed that the get_node_signal_file_name return the same name for each node, resulting in a race condition, since each node use the same file to assess inter-node concurrency.

I fixed such error using a previous methodology:

llmfoundry/utils/builders.py line:497

 f'.node_{dist.get_node_rank()}_local_rank0_completed_tokenizer_setup'

I think that is something related to the composer library. If this workaround is something sound, I can open an pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions