the DeepSpeed-Chat demo train.py cannot even run

I have 8 A100 GPU setup at GCP, and if I run this demo, I got error like this
```
$$ /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat# python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node

---=== Running Step 1 ===---
Running:
bash /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_1.3b.sh /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b 
---=== Finished Step 1 in 0:00:09 ===---
---=== Running Step 2 ===---
Running:
bash /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /export/home/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m 
---=== Finished Step 2 in 0:00:19 ===---
---=== Running Step 3 ===---
Traceback (most recent call last):
  File "train.py", line 210, in <module>
    main(args)
  File "train.py", line 194, in main
    cmd = get_cmd(args, step_num)
  File "train.py", line 160, in get_cmd
    verify_model(args, 1)  # Verify step 1 model exists
  File "train.py", line 149, in verify_model
    raise RuntimeError(error_str)
RuntimeError: Step 1 model has not been trained. Train it with:
python3 train.py --step 1 --actor-model 1.3b
```
So I guess it is not even running step 1, then we go to step 1 to run directly, based on the tutorial
now the log shows this
```
Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   7626) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000005e36d ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f133 clone()  ???:0
=================================
[2023-04-25 22:25:44,284] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6329
[2023-04-25 22:25:44,900] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6330
[2023-04-25 22:25:45,556] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6331
[2023-04-25 22:25:46,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6332
[2023-04-25 22:25:46,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6333
[2023-04-25 22:25:46,786] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6334
[2023-04-25 22:25:47,361] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6335
[2023-04-25 22:25:47,935] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6336
[2023-04-25 22:25:48,470] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -7
"training.log" 58L, 5376C                                                                                     

```
However, 8 A100 GPUs are there dangling doing nothing.

I am wondering if I have any issue configuring this (I successfully installed everything following the doc) and are those scripts at their working conditions? And if there is any unit or module tests to run against before doing any real experiments? I feel that this codebase is not yet ready for users to jump in if the demo cannot even run, or it cannot, at least give some insightful messages.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

the DeepSpeed-Chat demo train.py cannot even run #432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

the DeepSpeed-Chat demo train.py cannot even run #432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions