Skip to content

deepspeed hybrid-engine support bloom model with zero3? #497

Open
@null-test-7

Description

@null-test-7

We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.

Traceback (most recent call last):
  File "main.py", line 518, in <module>
    main()
  File "main.py", line 427, in main
    out = trainer.generate_experience(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 109, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 76, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1563, in generate
    return self.sample(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2610, in sample
    outputs = self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(0
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 340, in run_forward
    with GatheredParameters(non_active_params):
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1649, in __enter__
    self.params[0].all_gather(param_list=self.params)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 873, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1074, in _all_gather
    ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1286, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=self.local_device).view(-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed Chatnew-configA modified config from the given example

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions