Skip to content

[ERROR] [launch.py:434:sigkill_handler] #430

Open
@TheGravityZero

Description

@TheGravityZero

When I run this:

bash training_scripts/single_node/run_1.3b.sh

I have this:

Traceback (most recent call last):
  File "main.py", line 339, in <module>
    main()
  File "main.py", line 218, in main
    train_dataset, eval_dataset = create_prompt_dataset(
  File "/data/git/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 328, in create_prompt_dataset
    torch.save(train_dataset, train_fname)
  File "/data/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/serialization.py", line 422, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "/data/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/serialization.py", line 309, in _open_zipfile_writer
    return container(name_or_buffer)
  File "/data/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/serialization.py", line 287, in __init__
    super(_open_zipfile_writer_file, self).__init__(torch._C.PyTorchFileWriter(str(name)))
RuntimeError: File /tmp/data_files//traindata_-9071072583361525875.pt cannot be opened.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
[2023-04-25 21:53:13,367] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26822
[2023-04-25 21:53:14,277] [ERROR] [launch.py:434:sigkill_handler] ['/data/anaconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = 1

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatsystemAn issue with a environment/system setup.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions