第二篇论文中奖励模型训练的问题

我想复现第二篇论文的相关实验，但是遇到了一点问题不知道怎么解决
[2024-09-27 20:21:18,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 20:21:22,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
1
1
[2024-09-27 20:21:23,268] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 20:21:23,268] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(split_batches=True)
  warnings.warn(
2024-09-27 20:21:23 - INFO - Load **init** model from /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf
2024-09-27 20:21:23 - INFO - Loading tokenizer from huggingface: /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf...
2024-09-27 20:21:23 - INFO - Llama tokenizer size: 32000
2024-09-27 20:21:23 - INFO - Llama tokenizer pad token: <unk>, pad_token_id: 0
2024-09-27 20:21:23 - INFO - Llama tokenizer. special token: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.12it/s]
Some weights of LlamaRewardModel were not initialized from the model checkpoint at /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf and are newly initialized: ['reward_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-27 20:21:31 - INFO - Got 151214 samples from /mnt/sda/pr/new_proj/MOSS-RLHF/data/data_clean/hh-rlhf-strength-cleaned/train.json
2024-09-27 20:21:31 - INFO - Got 151214 samples totally from ['train.json']
[2024-09-27 20:21:31,825] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-27 20:21:31,825] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
E0927 20:21:54.924243 132676778354496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 1339036) of binary: /home/pr/conda/ENTER/envs/rlhf/bin/python3.8
Traceback (most recent call last):
  File "/home/pr/conda/ENTER/envs/rlhf/bin/accelerate", line 10, in <module>
    sys.exit(main())
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
train_rm.py FAILED
---------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-27_20:21:54
  host      : ps.ps
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 1339036)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1339036
=========================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

第二篇论文中奖励模型训练的问题 #58

train_rm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

第二篇论文中奖励模型训练的问题 #58

Description

train_rm.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-09-27_20:21:54 host : ps.ps rank : 0 (local_rank: 0) exitcode : -11 (pid: 1339036) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 1339036

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036