Open
Description
I have 8 A100 GPU setup at GCP, and if I run this demo, I got error like this
$$ /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat# python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node
---=== Running Step 1 ===---
Running:
bash /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_1.3b.sh /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
---=== Finished Step 1 in 0:00:09 ===---
---=== Running Step 2 ===---
Running:
bash /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
---=== Finished Step 2 in 0:00:19 ===---
---=== Running Step 3 ===---
Traceback (most recent call last):
File "train.py", line 210, in <module>
main(args)
File "train.py", line 194, in main
cmd = get_cmd(args, step_num)
File "train.py", line 160, in get_cmd
verify_model(args, 1) # Verify step 1 model exists
File "train.py", line 149, in verify_model
raise RuntimeError(error_str)
RuntimeError: Step 1 model has not been trained. Train it with:
python3 train.py --step 1 --actor-model 1.3b
So I guess it is not even running step 1, then we go to step 1 to run directly, based on the tutorial
now the log shows this
Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 7626) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000006929c ncclGroupEnd() ???:0
3 0x000000000005e36d ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
[2023-04-25 22:25:44,284] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6329
[2023-04-25 22:25:44,900] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6330
[2023-04-25 22:25:45,556] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6331
[2023-04-25 22:25:46,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6332
[2023-04-25 22:25:46,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6333
[2023-04-25 22:25:46,786] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6334
[2023-04-25 22:25:47,361] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6335
[2023-04-25 22:25:47,935] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6336
[2023-04-25 22:25:48,470] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7
"training.log" 58L, 5376C
However, 8 A100 GPUs are there dangling doing nothing.
I am wondering if I have any issue configuring this (I successfully installed everything following the doc) and are those scripts at their working conditions? And if there is any unit or module tests to run against before doing any real experiments? I feel that this codebase is not yet ready for users to jump in if the demo cannot even run, or it cannot, at least give some insightful messages.