- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.2k
 
Description
Describe the bug
Is it a bug when using DDP as data parallel in following code?
--------code1---------
Megatron-LM/megatron/training/training.py
Lines 992 to 1004 in 9458be9
| with torch.cuda.stream(torch.cuda.Stream()): | |
| model = [ | |
| DP( | |
| config=config, | |
| ddp_config=ddp_config, | |
| module=model_chunk, | |
| # Turn off bucketing for model_chunk 2 onwards, since communication for these | |
| # model chunks is overlapped with compute anyway. | |
| disable_bucketing=(model_chunk_idx > 0) | |
| or args.overlap_param_gather_with_optimizer_step, | |
| ) | |
| for (model_chunk_idx, model_chunk) in enumerate(model) | |
| ] | 
--------code2---------
Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py
Lines 736 to 753 in 9458be9
| for param in params[::-1]: | |
| param_start_index, param_end_index, bucket_id = self.param_index_map[param] | |
| # For MXFP8 param: we only need to map weight gradients to the buffer. | |
| if not self.ddp_config.reuse_grad_buf_for_mxfp8_param_ag: | |
| # Assign param.data to appropriate segment of self.param_data. | |
| if self.param_data is not None: | |
| new_param_data = self._get( | |
| param.data.shape, param_start_index, buffer_type=BufferType.PARAM | |
| ) | |
| if is_float8tensor(param): | |
| modify_underlying_storage(param, new_param_data) | |
| else: | |
| old_param_data = param.data | |
| param.data = new_param_data | |
| assert old_param_data._base is None | |
| # Copy tensor values (from initialization or checkpoint). | |
| param.data.detach().copy_(old_param_data) | |
| del old_param_data | 
DDP will create new bucket for model param which will copy value from old value.
But old value is inited when model initialization in current stream. DDP initialization with a new stream will lead to sync problems.
Steps/Code to reproduce bug
Any model pretrain.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.