Description
I'm running step1_supervised_finetuning script, but get the following error. Just to check in with you, does the step_1 training script accept downloaded models from local directory? If it does, what could be the reason to cause this error?
[2023-05-10 15:43:59,545] [INFO] [runner.py:550:main] cmd = /anaconda3/envs/transformers/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path wangrui6/Zhihu-KOL Cohere/miracl-zh-queries-22-12 Hello-SimpleAI/HC3-Chinese mkqa-Chinese --data_split 2,4,4 --model_name_or_path /PLM/opt-6.7b --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 1e-3 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 16 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 0 --lora_dim 128 --lora_module_name decoder.layers. --deepspeed --output_dir ./output
[2023-05-10 15:44:03,919] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-10 15:44:03,919] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-10 15:44:03,919] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-10 15:44:03,919] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-10 15:44:03,919] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-10 15:44:19,992] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "main.py", line 343, in
main()
File "main.py", line 205, in main
tokenizer = load_hf_tokenizer(args.model_name_or_path, fast_tokenizer=True)
File "/DeepSpeed-Chat/training/utils/utils.py", line 55, in load_hf_tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
File "/anaconda3/envs/transformers/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 619, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/anaconda3/envs/transformers/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 463, in get_tokenizer_config
resolved_config_file = cached_file(
File "/anaconda3/envs/transformers/lib/python3.8/site-packages/transformers/utils/hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "/anaconda3/envs/transformers/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "/anaconda3/envs/transformers/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/PLM/opt-6.7b'. Use repo_type
argument if needed.
[2023-05-10 15:44:21,949] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 23545
[2023-05-10 15:44:21,950] [ERROR] [launch.py:324:sigkill_handler] ['/anaconda3/envs/transformers/bin/python3.8', '-u', 'main.py', '--local_rank=0', '--data_path', 'wangrui6/Zhihu-KOL', 'Cohere/miracl-zh-queries-22-12', 'Hello-SimpleAI/HC3-Chinese', 'mkqa-Chinese', '--data_split', '2,4,4', '--model_name_or_path', '/PLM/opt-6.7b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '1e-3', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '16', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '0', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', './output'] exits with return code = 1