DeeperSpeed cannot support BFloat16 and PipelineParallelism

**Describe the bug**
When using an rwkv config ( to avoid running into the issue from #1305 )

I get the issue:
```
Traceback (most recent call last):
  File "/home/hatef.4/neox/gpt-neox/train.py", line 35, in <module>
    main()
  File "/home/hatef.4/neox/gpt-neox/train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 296, in pretrain
    iteration = train(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1465, in train
    loss_dict, skipped_iter = train_step(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1277, in train_step
    reduced_loss = train_step_pipe(
  File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1374, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 362, in train_batch
    self._exec_schedule(sched)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1345, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 277, in _exec_reduce_grads
    self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/engine.py", line 1898, in allreduce_gradients
    assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled
```

**To Reproduce**
Steps to reproduce the behavior:
1. install latest DeeperSpeed
2. run `rwkv/170M.yml` 

**Proposed solution**
Merging DeeperSpeed with upstream would work, but will need to fix #1306  first. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeeperSpeed cannot support BFloat16 and PipelineParallelism #1307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions