Draft(M-FSDP): Demonstrate the fix for Megatron-Bridge#3161
Draft(M-FSDP): Demonstrate the fix for Megatron-Bridge#3161shjwudp wants to merge 13 commits intoNVIDIA:mainfrom
Conversation
…on-LM into mfsdp_fix_mcore_tp_detect
|
/claude review |
| debug_slices = [] | ||
| for chunk_info, local_chunk in zip(local_chunks_info, all_local_chunks): | ||
| offset = chunk_info["offset"] | ||
| slices = tuple(slice(o, o + s) for o, s in zip(offset, local_chunk.shape)) | ||
| debug_slices.append(slices) | ||
|
|
There was a problem hiding this comment.
Leftover debug code: debug_slices is computed but never used. Should be removed.
| debug_slices = [] | |
| for chunk_info, local_chunk in zip(local_chunks_info, all_local_chunks): | |
| offset = chunk_info["offset"] | |
| slices = tuple(slice(o, o + s) for o, s in zip(offset, local_chunk.shape)) | |
| debug_slices.append(slices) | |
| # Reconstruct the full tensor by placing chunks at their correct offsets |
| >>> mesh = DeviceMesh("cuda", [0, 1, 2, 3]) | ||
| >>> # Create a DTensor with uneven sharding | ||
| >>> dtensor = DTensor(..., placements=[Shard(0)]) | ||
| >>> full_tensor = gather_uneven_dtensor_to_full_tensor(dtensor) |
There was a problem hiding this comment.
Docstring example still references the old function name gather_uneven_dtensor_to_full_tensor — should be uneven_dtensor_to_full_tensor.
| >>> full_tensor = gather_uneven_dtensor_to_full_tensor(dtensor) | |
| >>> full_tensor = uneven_dtensor_to_full_tensor(dtensor) |
| def is_mcore_tensor_parallel_duplicated(param: torch.Tensor) -> bool: | ||
| """ | ||
| Check if the given parameter is Megatron-Core tensor model parallel and duplicated. | ||
| """ | ||
| return getattr(param, "_tp_duplicated", False) | ||
| return get_mcore_tensor_parallel_partition_dim(param) is None |
There was a problem hiding this comment.
Bug: is_mcore_tensor_parallel_duplicated now returns True for any parameter without _tensor_parallel_mode set (including completely unannotated parameters). Combined with the change in param_and_grad_buffer.py where the guard switched from is_mcore_tensor_model_parallel(param) (per-param check) to using_tensor_parallel(dist_index) (global mesh check), any unannotated parameter will now be wrapped as Replicate() DTensor when TP is enabled.
If _detect_parallelism_type returns None for a module (e.g. a custom user module), its params won't get _tensor_parallel_mode set, but they'll still enter the TP DTensor path and be treated as duplicated. This may be the intended behavior, but the docstring is misleading — this function no longer checks if the param "is tensor model parallel and duplicated"; it returns True for anything that isn't column/row partitioned.
At minimum, consider guarding on hasattr(param, "_tensor_parallel_mode"):
def is_mcore_tensor_parallel_duplicated(param: torch.Tensor) -> bool:
"""
Check if the given parameter is Megatron-Core tensor model parallel and duplicated.
"""
return hasattr(param, "_tensor_parallel_mode") and get_mcore_tensor_parallel_partition_dim(param) is None| else: | ||
| device_type = "cpu" | ||
| device = torch.device("cpu") | ||
| backend = "gloo" | ||
|
|
||
| # Initialize process group if not already initialized | ||
| if not dist.is_initialized(): | ||
| dist.init_process_group( | ||
| backend=backend, | ||
| rank=rank, | ||
| world_size=world_size, | ||
| ) | ||
|
|
||
| yield { | ||
| "rank": rank, | ||
| "world_size": world_size, | ||
| "local_rank": local_rank, | ||
| "device_type": device_type, | ||
| "device": device, | ||
| } | ||
|
|
||
| # Cleanup | ||
| if dist.is_initialized(): | ||
| dist.destroy_process_group() | ||
|
|
||
|
|
||
| # ---------- Helper: distributed setup ---------- | ||
|
|
||
| @pytest.fixture(scope="module") | ||
| def distributed_setup(): | ||
| """Setup torch.distributed and CUDA device for torchrun + pytest.""" | ||
| if "RANK" not in os.environ or "WORLD_SIZE" not in os.environ: | ||
| pytest.skip("Not running under torchrun. Use torchrun to run this test file.") | ||
|
|
There was a problem hiding this comment.
The distributed_setup fixture is defined twice (first at line 31, again here). The second definition silently shadows the first. Remove the duplicate.
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.