Open
Description
The init method for the OptimizersInBackwardContainer
has a bug:
torchtitan/torchtitan/optimizer.py
Lines 99 to 101 in 2a44370
The hook closure tries to capture optim_dict
, but if you have more than 1 model_parts
, which can be the case with pipeline parallelism, this will capture the last optim_dict
, throwing an error on backward as the parameters in the first model part will not be contained in this dict.
Also, the fused backward+optim code doesn't seem to handle gradient clipping.