BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1

The init method for the `OptimizersInBackwardContainer` has a bug: https://github.com/pytorch/torchtitan/blob/2a4437014e66bcf88a3f0419b816266e6326d539/torchtitan/optimizer.py#L99-L101

The hook closure tries to capture `optim_dict`, but if you have more than 1 `model_parts`, which can be the case with pipeline parallelism, this will capture the last `optim_dict`, throwing an error on backward as the parameters in the first model part will not be contained in this dict.

Also, the fused backward+optim code doesn't seem to handle gradient clipping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def optim_hook(param) -> None:
	optim_dict[param].step()
	optim_dict[param].zero_grad()

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 #777

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions