Draft: compatible with Megatron-FSDP TP#2299
Draft: compatible with Megatron-FSDP TP#2299conver334 wants to merge 22 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
📝 WalkthroughWalkthroughIntroduces Megatron FSDP (Fully Sharded Data Parallel) integration with Hugging Face models through conversion scripts and supporting infrastructure. Adds round-trip conversion workflows, text generation capabilities, and DTensor-aware weight handling to bridge HF model formats with distributed Megatron FSDP models. Changes
Sequence DiagramsequenceDiagram
participant HF as HuggingFace Model
participant Bridge as HF-Megatron Bridge
participant Provider as Megatron Model Provider
participant FSDP as FSDP Model
participant Gen as Generation Loop
participant Tokenizer as Tokenizer
HF->>Bridge: Load via AutoBridge.from_hf_pretrained()
Bridge->>Provider: Create model provider & configure TP/CP/EP
Provider->>FSDP: Initialize distributed Megatron model
Bridge->>FSDP: Load HF weights into distributed model
FSDP->>FSDP: Move to CUDA & set eval mode
HF->>Tokenizer: Load tokenizer from HF model
loop For each generation step
Gen->>FSDP: Forward pass via get_forward_backward_func()
FSDP->>Gen: Return logits (last pipeline stage)
Gen->>Gen: All-gather outputs across TP world
Gen->>Gen: Select next token via argmax
Gen->>Gen: Broadcast token IDs to all ranks
Gen->>Gen: Update input_ids & position_ids
Gen->>Tokenizer: Check if EOS token generated
end
FSDP-->>HF: Save converted model to HF format
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/models/conversion/param_mapping.py (1)
862-870:⚠️ Potential issue | 🟡 MinorAdd defensive comment clarifying that
_module_uses_fsdp(None)is safe and explain the None case on non-owning PP ranks.The FSDP bypass assumption is correct for the primary call path in
stream_weights_megatron_to_hf(model_bridge.py lines 1007-1012), which ensures DTensor materialization before callingmegatron_to_hf. The adapter weight paths (peft_bridge.py) use regular tensor weights from adapters, not DTensors, so the FSDP bypass doesn't apply there.However,
_module_uses_fsdp(megatron_module)at line 865 lacks documentation: whilehasattr(None, "_parameters")safely returnsFalse, making the function handle None correctly, this behavior should be documented with a comment explaining that on non-owning PP ranks (wheremegatron_moduleisNone), the bypass safely falls through to gather logic. This clarifies the contract for future readers.
🤖 Fix all issues with AI agents
In `@examples/conversion/hf_fsdp_roundtrip.py`:
- Around line 178-180: Remove the redundant explicit process-group teardown:
delete the torch.distributed.is_initialized() check and the call to
torch.distributed.destroy_process_group() at the end of main(), because main()
is already decorated with `@torchrun_main` which handles process group
destruction; locate the cleanup using the symbol
torch.distributed.destroy_process_group (and main()) and remove those two lines
so the process group is only torn down once.
In `@examples/conversion/hf_to_megtron_fsdp_generate_text.py`:
- Around line 252-254: The explicit teardown calling
torch.distributed.destroy_process_group() after checking
torch.distributed.is_initialized() duplicates cleanup performed by the
`@torchrun_main` decorator and can cause double-destroy errors; remove the
explicit torch.distributed.destroy_process_group() call (or guard it behind
logic that detects whether `@torchrun_main` managed the group) so only one
teardown path runs, ensuring the code relies on `@torchrun_main` to manage process
group lifecycle instead of calling destroy_process_group() unconditionally.
- Line 165: The example prompt string assigned to the variable prompt contains a
typo ("reforcement"); update the prompt in hf_to_megtron_fsdp_generate_text.py
by changing the value of prompt from "what is reforcement learning?" to "what is
reinforcement learning?" so the example query uses the correct spelling.
- Line 1: Rename the script file named "hf_to_megtron_fsdp_generate_text.py" to
"hf_to_megatron_fsdp_generate_text.py" (insert the missing "a" in "megatron")
and update any references to that filename across the repo (examples index,
README, CI job, import or invocation sites) so they point to the new correct
name; verify the module is importable under the new name and adjust any script
entrypoints or tests that reference hf_to_megtron_fsdp_generate_text.py.
In `@src/megatron/bridge/models/conversion/model_bridge.py`:
- Around line 815-818: The DTensor branch in the weight conversion loop skips
the subsequent shape-compatibility checks because of the early continue; update
the DTensor handling (where task.param_weight is a DTensor and you use
task.param_weight.megatron_fsdp_slice to index converted_weights and copy into
task.param_weight._local_tensor) to first validate that
sliced_converted_weights.shape matches
task.param_weight._local_tensor.reshape(-1).shape (or expected local shard
shape) and raise/assert with a clear message on mismatch before performing the
copy_. Preserve the existing reshape and copy_ logic but add this defensive
assertion/check immediately after computing sliced_converted_weights and before
calling copy_, removing the silent-risk continue behavior.
- Around line 985-998: The code only sets unwrapped_model_list inside the if
use_megatron_fsdp branch, causing NameError when use_megatron_fsdp is False;
ensure unwrapped_model_list is defined for both branches by setting it to the
original megatron_model when not using FSDP (i.e., after computing
use_megatron_fsdp, if True call unwrap_model(megatron_model) else assign
unwrapped_model_list = megatron_model), then continue to call
build_conversion_tasks and build_adapter_conversion_tasks using
unwrapped_model_list and set unwrapped_model = unwrapped_model_list[0]; update
references to use unwrapped_model_list/unwrapped_model accordingly.
- Around line 848-851: The loop invoking install_optimized_model_weights on
original_megatron_model when use_megatron_fsdp is true should defensively check
for the method before calling it; update the block that iterates over
original_megatron_model (inside the use_megatron_fsdp branch) to either use
hasattr(m.module, "install_optimized_model_weights") before calling or wrap the
call in a try/except AttributeError so models without
install_optimized_model_weights are skipped and do not raise; keep the rest of
the return behavior unchanged.
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 2076-2080: The torch.chunk call that splits megatron_weights
(torch.chunk(megatron_weights, self.tp_size, dim=0)) can produce unequal shards
if megatron_weights.size(0) is not divisible by self.tp_size; add a defensive
check before that line (inside the same method where _module_uses_fsdp is used
and gather_from_tp_ranks is called) to assert megatron_weights.size(0) %
self.tp_size == 0 and raise/abort with a clear message including
megatron_weights.size(0) and self.tp_size so the conversion fails loudly instead
of producing unequal shards that corrupt GatedMLPMapping gate/split logic.
In `@tests/functional_tests/converter/test_hf_fsdp_conversion.py`:
- Around line 98-99: The test contains a duplicated call to
model.save_pretrained(model_dir, safe_serialization=True); remove the redundant
second invocation so the model is only saved once. Locate the duplicate calls to
model.save_pretrained in test_hf_fsdp_conversion.py (within the test function
where the model is prepared and saved) and delete the extra line, leaving a
single model.save_pretrained(model_dir, safe_serialization=True) call.
- Around line 145-154: Replace the use of "assert False" with pytest.fail and
add a timeout to the subprocess.run call to avoid hangs; specifically, when
running the conversion subprocess (the cmd variable passed into subprocess.run)
add a timeout argument (e.g., timeout=XXX) and on non-zero return use
pytest.fail(...) instead of assert False, including result.returncode,
result.stdout and result.stderr in the failure message so the test prints useful
diagnostics (locate the subprocess.run call and the subsequent if
result.returncode != 0 block referencing result and cmd).
🧹 Nitpick comments (6)
src/megatron/bridge/models/conversion/model_bridge.py (1)
176-200:unwrap_modellacks type hints and has a minimal docstring.Per coding guidelines, functions should use type hints for arguments and return types, and use Google-style docstrings.
Proposed improvement
-def unwrap_model(model, module_instances=None): - """Unwrap_model to return the final model instance""" +def unwrap_model( + model: nn.Module | list[nn.Module], + module_instances: tuple[type, ...] | None = None, +) -> nn.Module | list[nn.Module]: + """Unwrap a model by stripping DDP / FSDP / Float16Module wrappers. + + Args: + model: A single module or list of modules to unwrap. + module_instances: Tuple of wrapper types to strip. If ``None``, + a default set of known Megatron wrappers is used. + + Returns: + The innermost module(s) after removing all wrapper layers. + """As per coding guidelines: "Use type hints for function arguments and return types" and "Use Google style docstrings."
examples/conversion/hf_fsdp_roundtrip.py (2)
117-117: Triple-quoted strings used as inline comments within a function body.Lines 117 and 145 use triple-quoted strings (which are evaluated as no-op expressions) instead of
#comments. This is unconventional and may confuse linters or readers.Proposed fix
- """Export Megatron-FSDP model to HuggingFace format and verify the weights""" + # Export Megatron-FSDP model to HuggingFace format and verify the weights table = Table(title="Hugging Face Weights Verification") ... - """Save Megatron-FSDP model to HuggingFace format""" + # Save Megatron-FSDP model to HuggingFace format if _is_rank_zero():As per coding guidelines: "Reserve comments for code within a function or interfaces that are local to a file."
Also applies to: 145-145
16-24: Usage example in docstring usespython -m torch.distributed.runinstead ofuv run.As per coding guidelines for
examples/**/*.py: "Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly."tests/functional_tests/converter/test_hf_fsdp_conversion.py (2)
127-132: Usesys.executableinstead of"python"for subprocess invocation.Hardcoding
"python"may pick up a different interpreter than the test is running under, leading to import errors or environment mismatches.Proposed fix
cmd = [ - "python", + sys.executable, "-m", "torch.distributed.run",
91-95: Silent exception swallowing when saving tokenizer.The bare
except Exception: passhides real errors (e.g., disk full, permission denied). At minimum, log the failure.Proposed fix
try: tokenizer = AutoTokenizer.from_pretrained("gpt2") tokenizer.save_pretrained(model_dir) - except Exception: - pass + except Exception as e: + print(f"WARNING: Could not save tokenizer: {e}")examples/conversion/hf_to_megtron_fsdp_generate_text.py (1)
83-116: Duplicated utility functions across example scripts.
_is_rank_zero,_maybe_barrier,_get_world_size, and_configure_model_providerare identical between this file andhf_fsdp_roundtrip.py. Consider extracting them into a shared module underexamples/conversion/ormegatron.bridge.utils.
|
|
||
| if torch.distributed.is_initialized(): | ||
| torch.distributed.destroy_process_group() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the file structure around main() and the destroy_process_group call
cat -n examples/conversion/hf_fsdp_roundtrip.py | head -200 | tail -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2338
🏁 Script executed:
# Search for the main function and `@torchrun_main` decorator
rg -n "@torchrun_main|def main" examples/conversion/hf_fsdp_roundtrip.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 101
🏁 Script executed:
# Find where `@torchrun_main` is defined
rg -n "def torchrun_main|@torchrun_main" --type=py -B2 -A10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 15504
🏁 Script executed:
# Read the complete torchrun_main implementation
cat -n src/megatron/bridge/models/decorators/torchrun.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2365
🏁 Script executed:
# Verify by checking if there are any other destroy_process_group calls in the file
rg -n "destroy_process_group" examples/conversion/hf_fsdp_roundtrip.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 124
Remove redundant process group cleanup at lines 179–180.
main() is decorated with @torchrun_main, which automatically destroys the process group on successful completion (see src/megatron/bridge/models/decorators/torchrun.py:38–39). The explicit torch.distributed.destroy_process_group() call here is redundant and will cause a second destruction attempt, resulting in an error or warning.
Proposed fix
main(
hf_model_id=args.hf_model_id,
output_dir=args.output_dir,
tp=args.tp,
cp=args.cp,
ep=args.ep,
trust_remote_code=args.trust_remote_code,
)
-
- if torch.distributed.is_initialized():
- torch.distributed.destroy_process_group()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if torch.distributed.is_initialized(): | |
| torch.distributed.destroy_process_group() | |
| main( | |
| hf_model_id=args.hf_model_id, | |
| output_dir=args.output_dir, | |
| tp=args.tp, | |
| cp=args.cp, | |
| ep=args.ep, | |
| trust_remote_code=args.trust_remote_code, | |
| ) |
🤖 Prompt for AI Agents
In `@examples/conversion/hf_fsdp_roundtrip.py` around lines 178 - 180, Remove the
redundant explicit process-group teardown: delete the
torch.distributed.is_initialized() check and the call to
torch.distributed.destroy_process_group() at the end of main(), because main()
is already decorated with `@torchrun_main` which handles process group
destruction; locate the cleanup using the symbol
torch.distributed.destroy_process_group (and main()) and remove those two lines
so the process group is only torn down once.
| @@ -0,0 +1,254 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Filename typo: hf_to_megtron_fsdp_generate_text.py → hf_to_megatron_fsdp_generate_text.py.
Missing "a" in "megatron". This will make the script harder to discover and is inconsistent with naming elsewhere.
🤖 Prompt for AI Agents
In `@examples/conversion/hf_to_megtron_fsdp_generate_text.py` at line 1, Rename
the script file named "hf_to_megtron_fsdp_generate_text.py" to
"hf_to_megatron_fsdp_generate_text.py" (insert the missing "a" in "megatron")
and update any references to that filename across the repo (examples index,
README, CI job, import or invocation sites) so they point to the new correct
name; verify the module is importable under the new name and adjust any script
entrypoints or tests that reference hf_to_megtron_fsdp_generate_text.py.
|
|
||
| if torch.distributed.is_initialized(): | ||
| torch.distributed.destroy_process_group() |
There was a problem hiding this comment.
Potential double destruction of the process group.
Same issue as in hf_fsdp_roundtrip.py: @torchrun_main likely already destroys the process group, so the explicit teardown at line 253–254 may cause errors.
🤖 Prompt for AI Agents
In `@examples/conversion/hf_to_megtron_fsdp_generate_text.py` around lines 252 -
254, The explicit teardown calling torch.distributed.destroy_process_group()
after checking torch.distributed.is_initialized() duplicates cleanup performed
by the `@torchrun_main` decorator and can cause double-destroy errors; remove the
explicit torch.distributed.destroy_process_group() call (or guard it behind
logic that detects whether `@torchrun_main` managed the group) so only one
teardown path runs, ensuring the code relies on `@torchrun_main` to manage process
group lifecycle instead of calling destroy_process_group() unconditionally.
| if isinstance(task.param_weight, DTensor): | ||
| sliced_converted_weights = converted_weights.reshape(-1)[task.param_weight.megatron_fsdp_slice] | ||
| task.param_weight._local_tensor.reshape(-1).copy_(sliced_converted_weights) | ||
| continue |
There was a problem hiding this comment.
DTensor weight copy skips shape validation.
The continue on line 818 bypasses the shape-compatibility check (lines 820–845) that exists for regular tensors. A shape mismatch between converted_weights and the DTensor's expected local shard would silently corrupt the model. Consider adding a defensive assertion before the copy.
Proposed fix
if isinstance(task.param_weight, DTensor):
+ expected_size = task.param_weight._local_tensor.reshape(-1).numel()
sliced_converted_weights = converted_weights.reshape(-1)[task.param_weight.megatron_fsdp_slice]
+ if sliced_converted_weights.numel() != expected_size:
+ raise ValueError(
+ f"DTensor shape mismatch for {task.mapping.megatron_param}: "
+ f"expected {expected_size} elements, got {sliced_converted_weights.numel()}"
+ )
task.param_weight._local_tensor.reshape(-1).copy_(sliced_converted_weights)
continueBased on learnings: "Always validate tensor shapes before copying weights in weight conversion".
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 815 -
818, The DTensor branch in the weight conversion loop skips the subsequent
shape-compatibility checks because of the early continue; update the DTensor
handling (where task.param_weight is a DTensor and you use
task.param_weight.megatron_fsdp_slice to index converted_weights and copy into
task.param_weight._local_tensor) to first validate that
sliced_converted_weights.shape matches
task.param_weight._local_tensor.reshape(-1).shape (or expected local shard
shape) and raise/assert with a clear message on mismatch before performing the
copy_. Preserve the existing reshape and copy_ logic but add this defensive
assertion/check immediately after computing sliced_converted_weights and before
calling copy_, removing the silent-risk continue behavior.
| if use_megatron_fsdp: | ||
| for m in original_megatron_model: | ||
| m.module.install_optimized_model_weights() | ||
| return original_megatron_model |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find where install_optimized_model_weights is defined
rg -n "def install_optimized_model_weights" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Check the context of lines 848-851 in model_bridge.py
cat -n src/megatron/bridge/models/conversion/model_bridge.py | sed -n '840,860p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1212
🏁 Script executed:
#!/bin/bash
# Find what types of objects are in original_megatron_model
rg -B 5 -A 5 "original_megatron_model" src/megatron/bridge/models/conversion/model_bridge.py | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1101
🏁 Script executed:
#!/bin/bash
# Search for any hasattr or try-except guards around line 848-851
rg -B 10 -A 10 "install_optimized_model_weights" src/megatron/bridge/models/conversion/model_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1065
🏁 Script executed:
#!/bin/bash
# Search for similar method names that might be alternatives
rg -n "optimized.*weight|install.*weight" --type=py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 187
🏁 Script executed:
#!/bin/bash
# Check what FullyShardedDataParallel wraps and what unwrap_model returns
rg -n "def unwrap_model|class FullyShardedDataParallel" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 176
🏁 Script executed:
#!/bin/bash
# Search for any hasattr checks related to model methods
rg -n "hasattr.*module.*install|hasattr.*optimized" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Look at the unwrap_model function to understand what it returns
cat -n src/megatron/bridge/models/conversion/model_bridge.py | sed -n '176,210p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1895
🏁 Script executed:
#!/bin/bash
# Search for any model class definitions that might have this method
rg -n "class.*Model.*:" src/megatron/bridge/models/ --type=py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2109
🏁 Script executed:
#!/bin/bash
# Check if install_optimized_model_weights might be expected to be imported
rg -n "import.*install_optimized_model_weights|from.*install_optimized_model_weights" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Search for any recent additions or TODOs related to this
rg -n "TODO|FIXME|XXX.*optim|XXX.*weight" src/megatron/bridge/models/conversion/model_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 224
🏁 Script executed:
#!/bin/bash
# Check if install_optimized_model_weights might be defined in megatron.core
find . -type f -name "*.py" -exec grep -l "def install_optimized_model_weights" {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Check the broader context of the FSDP handling - see how models are created
rg -B 20 "use_megatron_fsdp = isinstance" src/megatron/bridge/models/conversion/model_bridge.py | head -40Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1664
🏁 Script executed:
#!/bin/bash
# Check if there's a pattern of hasattr checks elsewhere in the file for optional methods
rg -n "hasattr.*module" src/megatron/bridge/models/conversion/model_bridge.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 228
🏁 Script executed:
#!/bin/bash
# Check if this method might be expected to be in megatron.core or another external library
rg -n "install_optimized_model_weights" . --type=py 2>/dev/null | wc -lRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 72
🏁 Script executed:
#!/bin/bash
# Look at what type m.module would be - check base model classes
rg -n "class.*GPT.*Model|class.*T5.*Model" src/megatron/bridge/models/ --type=py | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1121
🏁 Script executed:
#!/bin/bash
# Check git history to see if this is newly added (last few commits)
git log -10 --oneline src/megatron/bridge/models/conversion/model_bridge.py 2>/dev/null || echo "Git history not available"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 110
🏁 Script executed:
#!/bin/bash
# Check the entire git diff or blame for this line
git log -p -S "install_optimized_model_weights" -- src/megatron/bridge/models/conversion/model_bridge.py 2>/dev/null | head -50 || echo "Git details not available"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Check if there's any pattern of methods that models are expected to have
rg -n "m\.module\." src/megatron/bridge/models/conversion/model_bridge.py | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 133
🏁 Script executed:
#!/bin/bash
# Look at the specific commit to understand context
git show 4ac2961 --stat 2>/dev/null | head -20 || echo "Commit details not available"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1022
🏁 Script executed:
#!/bin/bash
# Check if there's a type annotation or comment explaining what m.module should be
cat -n src/megatron/bridge/models/conversion/model_bridge.py | sed -n '845,860p'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 875
Add a defensive check before calling install_optimized_model_weights().
The method install_optimized_model_weights() is not defined anywhere in the codebase and is being called unconditionally at line 850. This will raise AttributeError if a model variant doesn't implement this method. Add a hasattr check (consistent with the pattern used elsewhere in this file, e.g., line 794) or wrap in a try-except block to handle models that don't have this method.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/conversion/model_bridge.py` around lines 848 -
851, The loop invoking install_optimized_model_weights on
original_megatron_model when use_megatron_fsdp is true should defensively check
for the method before calling it; update the block that iterates over
original_megatron_model (inside the use_megatron_fsdp branch) to either use
hasattr(m.module, "install_optimized_model_weights") before calling or wrap the
call in a try/except AttributeError so models without
install_optimized_model_weights are skipped and do not raise; keep the rest of
the return behavior unchanged.
| if _module_uses_fsdp(megatron_module): | ||
| gathered_shards = torch.chunk(megatron_weights, self.tp_size, dim=0) | ||
| else: | ||
| # Gather shards from all TP ranks | ||
| gathered_shards = self.gather_from_tp_ranks(megatron_weights) |
There was a problem hiding this comment.
GatedMLPMapping: torch.chunk(megatron_weights, self.tp_size, dim=0) relies on the full tensor being exactly tp_size-chunked.
If the full FSDP-materialized tensor's dim 0 is not evenly divisible by tp_size, torch.chunk will produce unequal splits, which would silently corrupt the gate/up splitting downstream. Consider adding a divisibility assertion.
Proposed defensive check
if _module_uses_fsdp(megatron_module):
+ assert megatron_weights.shape[0] % self.tp_size == 0, (
+ f"FSDP full tensor dim 0 ({megatron_weights.shape[0]}) must be divisible by tp_size ({self.tp_size})"
+ )
gathered_shards = torch.chunk(megatron_weights, self.tp_size, dim=0)Based on learnings: "Always validate tensor shapes before copying weights in weight conversion".
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/conversion/param_mapping.py` around lines 2076 -
2080, The torch.chunk call that splits megatron_weights
(torch.chunk(megatron_weights, self.tp_size, dim=0)) can produce unequal shards
if megatron_weights.size(0) is not divisible by self.tp_size; add a defensive
check before that line (inside the same method where _module_uses_fsdp is used
and gather_from_tp_ranks is called) to assert megatron_weights.size(0) %
self.tp_size == 0 and raise/abort with a clear message including
megatron_weights.size(0) and self.tp_size so the conversion fails loudly instead
of producing unequal shards that corrupt GatedMLPMapping gate/split logic.
| model.save_pretrained(model_dir, safe_serialization=True) | ||
| model.save_pretrained(model_dir, safe_serialization=True) |
There was a problem hiding this comment.
Duplicate model.save_pretrained call.
model.save_pretrained(model_dir, safe_serialization=True) is called twice in succession. This appears to be a copy-paste error — the second call is redundant.
Proposed fix
model.save_pretrained(model_dir, safe_serialization=True)
- model.save_pretrained(model_dir, safe_serialization=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model.save_pretrained(model_dir, safe_serialization=True) | |
| model.save_pretrained(model_dir, safe_serialization=True) | |
| model.save_pretrained(model_dir, safe_serialization=True) |
🤖 Prompt for AI Agents
In `@tests/functional_tests/converter/test_hf_fsdp_conversion.py` around lines 98
- 99, The test contains a duplicated call to model.save_pretrained(model_dir,
safe_serialization=True); remove the redundant second invocation so the model is
only saved once. Locate the duplicate calls to model.save_pretrained in
test_hf_fsdp_conversion.py (within the test function where the model is prepared
and saved) and delete the extra line, leaving a single
model.save_pretrained(model_dir, safe_serialization=True) call.
| try: | ||
| result = subprocess.run( | ||
| cmd, capture_output=True, text=True, cwd=Path(__file__).parent.parent.parent.parent | ||
| ) | ||
|
|
||
| # Check that the conversion completed successfully | ||
| if result.returncode != 0: | ||
| print(f"STDOUT: {result.stdout}") | ||
| print(f"STDERR: {result.stderr}") | ||
| assert False, f"FSDP Roundtrip failed with return code {result.returncode}" |
There was a problem hiding this comment.
Use pytest.fail() instead of assert False and add a timeout.
assert False is stripped under python -O (as flagged by static analysis B011). Also, the subprocess has no timeout, so a hung conversion will hang CI indefinitely.
Proposed fix
try:
result = subprocess.run(
- cmd, capture_output=True, text=True, cwd=Path(__file__).parent.parent.parent.parent
+ cmd, capture_output=True, text=True, timeout=600,
+ cwd=Path(__file__).parent.parent.parent.parent,
)
# Check that the conversion completed successfully
if result.returncode != 0:
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
- assert False, f"FSDP Roundtrip failed with return code {result.returncode}"
+ pytest.fail(f"FSDP Roundtrip failed with return code {result.returncode}")🧰 Tools
🪛 Ruff (0.14.14)
[error] 146-146: subprocess call: check for execution of untrusted input
(S603)
[warning] 154-154: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🤖 Prompt for AI Agents
In `@tests/functional_tests/converter/test_hf_fsdp_conversion.py` around lines 145
- 154, Replace the use of "assert False" with pytest.fail and add a timeout to
the subprocess.run call to avoid hangs; specifically, when running the
conversion subprocess (the cmd variable passed into subprocess.run) add a
timeout argument (e.g., timeout=XXX) and on non-zero return use pytest.fail(...)
instead of assert False, including result.returncode, result.stdout and
result.stderr in the failure message so the test prints useful diagnostics
(locate the subprocess.run call and the subsequent if result.returncode != 0
block referencing result and cmd).
What does this PR do ?
In #1910, converting models between HuggingFace and Megatron-FSDP formats is not supported when TP (Tensor Parallel) is enabled.
The issue was caused by incorrect detection of TP mode for MCore model parameters in M-FSDP under certain conditions.
PR 3161,PR 3191,PR 3287 will fix this problem. After updating, the interface
gather_uneven_dtensor_to_full_tensorwill be renamed touneven_dtensor_to_full_tensor, and its return type will be changed fromDTensortoTensor. This PR updates the interface usage accordingly to stay compatible with those changes.Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
cc: @ISEEKYAN @shjwudp
Summary by CodeRabbit
Release Notes
New Features
Tests