Open
Description
We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.
Traceback (most recent call last):
File "main.py", line 518, in <module>
main()
File "main.py", line 427, in main
out = trainer.generate_experience(prompts)
File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 109, in generate_experience
seq = self._generate_sequence(prompts)
File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 76, in _generate_sequence
seq = self.actor_model.module.generate(prompts,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
generate_ret_vals = self._generate(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1563, in generate
return self.sample(
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2610, in sample
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
outputs = block(0
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 340, in run_forward
with GatheredParameters(non_active_params):
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1649, in __enter__
self.params[0].all_gather(param_list=self.params)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 873, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1074, in _all_gather
ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1286, in _allgather_params_coalesced
flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=self.local_device).view(-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally