Skip to content

run deepspeed_chat example code error #313

Open
@bestpredicts

Description

@bestpredicts

when I run code bash training_scripts/single_node/run_1.3b.sh , meet error

ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0961456298828125 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10256075859069824 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10253238677978516 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10290169715881348 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10215353965759277 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1021888256072998 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10500884056091309 seconds
load data done.

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.09672141075134277 seconds
[2023-04-15 15:31:34,267] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1+cc67f22f, git-hash=cc67f22f, git-branch=master
[2023-04-15 15:31:34,272] [INFO] [comm.py:580:init_distributed] Distributed backend already initialized
[2023-04-15 15:31:45,812] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41526
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41527
[2023-04-15 15:31:46,842] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41528
[2023-04-15 15:31:46,844] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41529
[2023-04-15 15:31:46,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41530
[2023-04-15 15:31:46,847] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41531
[2023-04-15 15:31:46,848] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41532
[2023-04-15 15:31:46,849] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41533
[2023-04-15 15:31:46,850] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=7', '--model_name_or_path', '/code/tmp/pretrained_model/opt-1.3b', '--gradient_accumulation_steps', '2', '--zero_stage', '2', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '1e-5', '--deepspeed', '--output_dir', './output'] exits with return code = -7

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdeespeed chatDeepSpeed Chathybrid enginerelating to the hybrid engine

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions