Description
Hi I am trying this guide https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat.
I have set up the environment but I didn't find in the document what models and datasets I need to download and where to store them. The connection is slow and storage is limited so I wanna prepare everything in advance. So, how can I make the script to load models from local path?????
I have prepared opt-1.3b and opt-350m and Dahoas dataset, and they are under the project directory.
DeepSpeedExamples/applications/DeepSpeed-Chat# ls Dahoas facebook
Dahoas:
rm-static
facebook:
opt-1.3b opt-350m
but when I ran python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 2 3
the script is still trying to download it .. and returned connection error...
[2023-05-26 10:47:55,167] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-26 10:47:55,197] [INFO] [runner.py:541:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_accumulation_steps 4 --zero_stage 0 --deepspeed --output_dir /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-26 10:47:56,698] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-26 10:47:56,698] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-26 10:47:56,698] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-26 10:47:56,698] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-26 10:47:58,770] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "/opt/conda/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/opt-350m/resolve/main/config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/transformers/utils/hub.py", line 429, in cached_file
local_files_only=local_files_only,
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1199, in hf_hub_download
timeout=etag_timeout,
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1541, in get_hf_file_metadata
hf_raise_for_status(r)
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64708e60-3fe7220d286716304dafaa08)
Repository Not Found for url: https://huggingface.co/opt-350m/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 352, in <module>
main()
File "main.py", line 204, in main
tokenizer = load_hf_tokenizer(args.model_name_or_path, fast_tokenizer=True)
File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/utils.py", line 53, in load_hf_tokenizer
fast_tokenizer=True)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict
_commit_hash=commit_hash,
File "/opt/conda/lib/python3.7/site-packages/transformers/utils/hub.py", line 434, in cached_file
f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
OSError: opt-350m is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
[2023-05-26 10:48:01,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 334
[2023-05-26 10:48:01,711] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--weight_decay', '0.1', '--disable_dropout', '--gradient_accumulation_steps', '4', '--zero_stage', '0', '--deepspeed', '--output_dir', '/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = 1
It also happened when I ran step 1 earlier... but later it somehow became fine and succesfully proceeded... but now with step 2 and 3 the same issue happened again...
May I know how to get around with it?