Skip to content

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

Open
@cassanof

Description

The init method for the OptimizersInBackwardContainer has a bug:

def optim_hook(param) -> None:
optim_dict[param].step()
optim_dict[param].zero_grad()

The hook closure tries to capture optim_dict, but if you have more than 1 model_parts, which can be the case with pipeline parallelism, this will capture the last optim_dict, throwing an error on backward as the parameters in the first model part will not be contained in this dict.

Also, the fused backward+optim code doesn't seem to handle gradient clipping.

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions