Running multinode training and received unclear error for stage 2 training

I was running the Stage 2 reward model training with the multinode setup and I experienced the following error

```
... truncated

  0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 535.33it/s]
Found cached dataset webgpt_comparisons (/home/work/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 323.93it/s]
Found cached dataset json (/home/work/.cache/huggingface/datasets/stanfordnlp___json/stanfordnlp--SHP-10ead9e54f5a107d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)

  0%|          | 0/3 [00:00<?, ?it/s]
100%|██████████| 3/3 [00:00<00:00, 137.81it/s]
[2023-04-17 03:02:34,101] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 656
[2023-04-17 03:02:34,106] [ERROR] [launch.py:434:sigkill_handler] ['/usr/local/python3/bin/python3.10', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '5e-5', '--weight_decay', '0.1', '--num_train_epochs', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/work/Deepspeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = -9
```

I was using the default reward model training config. I didn't experience the issue when I was using the single_gpu setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multinode training and received unclear error for stage 2 training #327

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running multinode training and received unclear error for stage 2 training #327

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions