Hi I am trying this guide
I have set up the environment but I didn't find in the document what models and datasets I need to download and where to store them. The connection is slow and storage is limited so I wanna prepare everything in advance. So, how can I make the script to load models from local path?????
I have prepared opt-1.3b and opt-350m and Dahoas dataset, and they are under the project directory.
DeepSpeedExamples/applications/DeepSpeed-Chat# ls Dahoas facebook
opt-1.3b opt-350m
but when I ran python --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 2 3
the script is still trying to download it .. and returned connection error...
[2023-05-26 10:47:55,167] [WARNING] [] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-26 10:47:55,197] [INFO] [] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr= --master_port=29500 --enable_each_rank_log=None --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_accumulation_steps 4 --zero_stage 0 --deepspeed --output_dir /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [] 0 NCCL_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-26 10:47:56,698] [INFO] [] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [] WORLD INFO DICT: {'localhost': [0]}
[2023-05-26 10:47:56,698] [INFO] [] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-26 10:47:56,698] [INFO] [] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-26 10:47:56,698] [INFO] [] dist_world_size=1
[2023-05-26 10:47:56,698] [INFO] [] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-26 10:47:58,770] [INFO] [] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/", line 259, in hf_raise_for_status
File "/opt/conda/lib/python3.7/site-packages/requests/", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/transformers/utils/", line 429, in cached_file
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/", line 1199, in hf_hub_download
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/", line 1541, in get_hf_file_metadata
File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/", line 291, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64708e60-3fe7220d286716304dafaa08)
Repository Not Found for url:
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 352, in <module>
File "", line 204, in main
tokenizer = load_hf_tokenizer(args.model_name_or_path, fast_tokenizer=True)
File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/", line 53, in load_hf_tokenizer
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/", line 659, in from_pretrained
pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/", line 928, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/", line 574, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/", line 641, in _get_config_dict
File "/opt/conda/lib/python3.7/site-packages/transformers/utils/", line 434, in cached_file
f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
OSError: opt-350m is not a local folder and is not a valid model identifier listed on ''
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
[2023-05-26 10:48:01,710] [INFO] [] Killing subprocess 334
[2023-05-26 10:48:01,711] [ERROR] [] ['/opt/conda/bin/python', '-u', '', '--local_rank=0', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--weight_decay', '0.1', '--disable_dropout', '--gradient_accumulation_steps', '4', '--zero_stage', '0', '--deepspeed', '--output_dir', '/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = 1
It also happened when I ran step 1 earlier... but later it somehow became fine and succesfully proceeded... but now with step 2 and 3 the same issue happened again...
May I know how to get around with it?