Skip to content

Multi stream sync problem in model init #2044

@liddk

Description

@liddk

Describe the bug
Is it a bug when using DDP as data parallel in following code?
--------code1---------

with torch.cuda.stream(torch.cuda.Stream()):
model = [
DP(
config=config,
ddp_config=ddp_config,
module=model_chunk,
# Turn off bucketing for model_chunk 2 onwards, since communication for these
# model chunks is overlapped with compute anyway.
disable_bucketing=(model_chunk_idx > 0)
or args.overlap_param_gather_with_optimizer_step,
)
for (model_chunk_idx, model_chunk) in enumerate(model)
]

--------code2---------
for param in params[::-1]:
param_start_index, param_end_index, bucket_id = self.param_index_map[param]
# For MXFP8 param: we only need to map weight gradients to the buffer.
if not self.ddp_config.reuse_grad_buf_for_mxfp8_param_ag:
# Assign param.data to appropriate segment of self.param_data.
if self.param_data is not None:
new_param_data = self._get(
param.data.shape, param_start_index, buffer_type=BufferType.PARAM
)
if is_float8tensor(param):
modify_underlying_storage(param, new_param_data)
else:
old_param_data = param.data
param.data = new_param_data
assert old_param_data._base is None
# Copy tensor values (from initialization or checkpoint).
param.data.detach().copy_(old_param_data)
del old_param_data

DDP will create new bucket for model param which will copy value from old value.
But old value is inited when model initialization in current stream. DDP initialization with a new stream will lead to sync problems.

Steps/Code to reproduce bug

Any model pretrain.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions