Skip to content

Step2: memory allocation of 2097152 bytes failed #321

Open
@YukinoshitaKaren

Description

@YukinoshitaKaren

when I run step2 using 'bash training_scripts/single_node/run_350m.sh' meet error

[2023-04-16 21:36:09,031] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-04-16 21:36:09,031] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-04-16 21:36:09,031] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-16 21:36:09,031] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-16 21:36:16,042] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
memory allocation of memory allocation of 20971522097152 bytes failed
 bytes failed
memory allocation of 2097152 bytes failed
[2023-04-16 21:54:13,235] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62215
[2023-04-16 21:54:16,017] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62216
[2023-04-16 21:54:16,029] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62217
[2023-04-16 21:54:18,477] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62218
[2023-04-16 21:54:21,046] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62219
[2023-04-16 21:54:21,057] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62220
[2023-04-16 21:54:21,060] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62221
[2023-04-16 21:54:23,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62222

I have already allocate 50g memory, but still failed

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdeespeed chatDeepSpeed Chat

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions