Description
Describe the bug
I have been able to run my model successfully in Zero Stage 3 without any problems. However, when I attempt to run the same model in Zero Stage 2, I encounter an error:
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.223.13.141: Traceback (most recent call last):
10.78.107.139: self._take_model_step(lr_kwargs)
10.67.196.141: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
10.78.121.13:
10.223.17.15: main()
10.223.13.141: File "train.py", line 706, in
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
10.78.121.13: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.223.17.15: File "train.py", line 587, in main
10.67.196.141: self._take_model_step(lr_kwargs)
10.78.107.139: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
10.223.13.141: scaled_global_grad_norm = self.scaled_global_norm()
10.78.121.13: self._take_model_step(lr_kwargs)
10.223.17.15: return torch.norm(torch.stack(norm_groups), p=norm_type)
10.67.196.141: self.optimizer.step()
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1789, in scaled_global_norm
10.223.13.141: self.optimizer.step()
10.78.121.13: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/torch/functional.py", line 1626, in norm
10.78.107.139: self.optimizer.step()self.optimizer.step()
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.223.17.15: self._take_model_step(lr_kwargs)
10.78.107.139:
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.78.121.13: self.optimizer.step()Traceback (most recent call last):
10.223.13.141: main()
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.67.196.141: main()
10.223.17.15: accelerator.backward(loss)
10.78.107.139: self.engine.step()
10.223.13.141: File "train.py", line 587, in main
10.78.121.13:
10.67.196.141: File "train.py", line 587, in main
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
10.223.17.15: main()
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
10.223.13.141: self.optimizer.step()self._take_model_step(lr_kwargs)
10.78.121.13: self.optimizer.step() File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.78.121.13:
10.67.196.141: self.engine.step()
10.223.17.15: File "train.py", line 587, in main
10.78.107.139: self._take_model_step(lr_kwargs)
10.223.13.141:
10.78.121.13: File "train.py", line 706, in
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
10.223.17.15: scaled_global_grad_norm = self.scaled_global_norm()
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.78.121.13: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.78.107.139: main()
10.67.196.141: scaled_global_grad_norm = self.scaled_global_norm()
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1789, in scaled_global_norm
10.78.121.13: main()
10.223.13.141: self._take_model_step(lr_kwargs)
10.78.107.139: File "train.py", line 587, in main
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1789, in scaled_global_norm
10.223.17.15: return torch.linalg.vector_norm(input, _p, _dim, keepdim, dtype=dtype)
10.78.121.13: self.optimizer.step()
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
10.78.107.139: Traceback (most recent call last):
10.223.13.141: accelerator.backward(loss)
10.223.17.15: RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long
10.78.121.13: File "train.py", line 587, in main
10.223.17.15: self.optimizer.step()
10.67.196.141: scaled_global_grad_norm = self.scaled_global_norm()
10.78.107.139: scaled_global_grad_norm = self.scaled_global_norm()
10.223.13.141: self._take_model_step(lr_kwargs)
10.78.121.13: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.78.121.13: self.optimizer.step()
10.78.107.139: File "train.py", line 706, in
10.223.17.15: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.223.13.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/accelerate/accelerator.py", line 1960, in backward
10.67.196.141: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1789, in scaled_global_norm
10.78.121.13: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1842, in step
10.78.107.139: File "/opt/conda/envs/progen/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1789, in scaled_global_norm
To Reproduce
I am using a cluster of 6 machines, each equipped with 40GB A100 GPUs.
launch cmd:
accelerate launch --config_file config/yaml/zero.yaml train.py --with_tracking --report_to tensorboard --output_dir tblog --project_name test --checkpointing_steps epoch --per_device_train_batch_size 8 --num_train_epochs 50 --learning_rate 5e-5 --seed 42 --precision bf16
Here is my yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: ./config/zero/zero_stage2_bf16.json
deepspeed_multinode_launcher: pdsh
deepspeed_hostfile: ./config/yaml/hostfile
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: 10.223.17.15
main_process_port: 36769
main_training_function: main
num_machines: 6
num_processes: 48
use_cpu: false
zero_stage2_bf16.json:
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": 533770
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Activity