[Auto Parallel] Add tensor_fusion and overlap in auto dy sharding #72551
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Auto Parallel
PR Types
New features
Description
param.main_grad
replaces the oldmaster_grad
in auto dy.param.main_grad
will use inplaceadd_
to save or cast grad to fp32 and store them inparam.main_grad
.export Flags_enable_inplace_master_grad=1
.tensor_fusion
groups params and grads into continuousparam_storage
andgrad_storage
.grad_storage
is used for grad'sreduce_scatter
comm.param_storage
is used for param'sall_gather
comm.param_storage
andgrad_storage
usingview_slice
.grad_chip
requires callall_reduce
manually to collectglobal_norm_var
.export FLAGS_enable_tensor_fusion=1
.reduce_scatter
comm for grads with grad computation in bwd.all_gather
comm for params with opt computation.export FLAGS_enable_tensor_fusion=1
.Note: non-uniform
tensor_fusion
changes the order ofadd
ingrad_chip
, introducing some loss diff.Convergence results on llama7b, 1NC8, sharding8, 50,000 steps.
Pcard-70448