Skip to content

Model and data downloading #550

Open
@treya-lin

Description

@treya-lin

Hi I am trying this guide https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat.
I have set up the environment but I didn't find in the document what models and datasets I need to download and where to store them. The connection is slow and storage is limited so I wanna prepare everything in advance. So, how can I make the script to load models from local path?????

I have prepared opt-1.3b and opt-350m and Dahoas dataset, and they are under the project directory.

DeepSpeedExamples/applications/DeepSpeed-Chat# ls Dahoas facebook
Dahoas:
rm-static

facebook:
opt-1.3b  opt-350m

but when I ran python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 2 3 the script is still trying to download it .. and returned connection error...

[2023-05-26 10:47:55,167] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-26 10:47:55,197] [INFO] [runner.py:541:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_accumulation_steps 4 --zero_stage 0 --deepspeed --output_dir /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.9.9-1+cuda11.3
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-26 10:47:56,698] [INFO] [launch.py:222:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.9.9-1
[2023-05-26 10:47:56,698] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-26 10:47:56,698] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-26 10:47:56,698] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-26 10:47:56,698] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-26 10:47:56,698] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-26 10:47:58,770] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
    response.raise_for_status()
  File "/opt/conda/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/opt-350m/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/transformers/utils/hub.py", line 429, in cached_file
    local_files_only=local_files_only,
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1199, in hf_hub_download
    timeout=etag_timeout,
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/file_download.py", line 1541, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/opt/conda/lib/python3.7/site-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64708e60-3fe7220d286716304dafaa08)

Repository Not Found for url: https://huggingface.co/opt-350m/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 352, in <module>
    main()
  File "main.py", line 204, in main
    tokenizer = load_hf_tokenizer(args.model_name_or_path, fast_tokenizer=True)
  File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/utils.py", line 53, in load_hf_tokenizer
    fast_tokenizer=True)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 659, in from_pretrained
    pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py", line 574, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py", line 641, in _get_config_dict
    _commit_hash=commit_hash,
  File "/opt/conda/lib/python3.7/site-packages/transformers/utils/hub.py", line 434, in cached_file
    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
OSError: opt-350m is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
[2023-05-26 10:48:01,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 334
[2023-05-26 10:48:01,711] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--weight_decay', '0.1', '--disable_dropout', '--gradient_accumulation_steps', '4', '--zero_stage', '0', '--deepspeed', '--output_dir', '/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = 1

It also happened when I ran step 1 earlier... but later it somehow became fine and succesfully proceeded... but now with step 2 and 3 the same issue happened again...
May I know how to get around with it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    deespeed chatDeepSpeed ChatquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions