Single node multi card training failed

(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
---=== Running Step 1 ===---
Running:
bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b


GPU usage rate:
(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ nvidia-smi
Sat Apr 15 15:02:38 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:4F:00.0 Off |                    0 |
| N/A   35C    P0    71W / 300W |   1015MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:52:00.0 Off |                    0 |
| N/A   36C    P0    69W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:56:00.0 Off |                    0 |
| N/A   35C    P0    67W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:57:00.0 Off |                    0 |
| N/A   37C    P0    69W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  Off  | 00000000:CE:00.0 Off |                    0 |
| N/A   36C    P0    69W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80G...  Off  | 00000000:D1:00.0 Off |                    0 |
| N/A   37C    P0    70W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80G...  Off  | 00000000:D5:00.0 Off |                    0 |
| N/A   38C    P0    70W / 300W |   1019MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80G...  Off  | 00000000:D6:00.0 Off |                    0 |
| N/A   40C    P0    74W / 300W |    999MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+




(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ tail -f  output/actor-models/13b/training.log
[2023-04-15 14:56:08,436] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-15 14:56:08,601] [INFO] [runner.py:540:main] cmd = /home/menkeyi/.conda/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 2,4,4 --model_name_or_path facebook/opt-13b --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --lora_dim 128 --lora_module_name decoder.layers. --deepspeed --output_dir /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b
[2023-04-15 14:56:13,294] [INFO] [launch.py:222:main] 0 NCCL_P2P_LEVEL=SYS
[2023-04-15 14:56:13,294] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-04-15 14:56:13,294] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-04-15 14:56:13,294] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-04-15 14:56:13,294] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-15 14:56:13,294] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-15 14:56:28,848] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single node multi card training failed #310

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single node multi card training failed #310

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions