-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When using an rwkv config ( to avoid running into the issue from #1305 )
I get the issue:
Traceback (most recent call last):
File "/home/hatef.4/neox/gpt-neox/train.py", line 35, in <module>
main()
File "/home/hatef.4/neox/gpt-neox/train.py", line 31, in main
pretrain(neox_args=neox_args)
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 296, in pretrain
iteration = train(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1465, in train
loss_dict, skipped_iter = train_step(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1277, in train_step
reduced_loss = train_step_pipe(
File "/home/hatef.4/neox/gpt-neox/megatron/training.py", line 1374, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 362, in train_batch
self._exec_schedule(sched)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1345, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 277, in _exec_reduce_grads
self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/hatef.4/neox/DeeperSpeed/deepspeed/runtime/engine.py", line 1898, in allreduce_gradients
assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled
To Reproduce
Steps to reproduce the behavior:
- install latest DeeperSpeed
- run
rwkv/170M.yml
Proposed solution
Merging DeeperSpeed with upstream would work, but will need to fix #1306 first.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working