[megatron] feat: add arg to offload bridged weights to CPU #6714

HollowMan6 · 2025-11-21T22:16:29Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Now offload_bridge is a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.

Experiment results

Paste your experiment result here(if needed).

_{✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.}

gemini-code-assist · 2025-11-21T22:17:40Z

Summary of Changes

Hello @HollowMan6, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving memory management within the GPTBridge code to prevent OOM errors. It introduces CPU cloning for weights and adds explicit garbage collection to free up GPU memory during LoRA merging and weight manipulation, enhancing the stability and efficiency of the model.

Highlights

OOM Prevention: Addresses potential Out-of-Memory (OOM) errors by cloning weights to the CPU during GPTBridge operations.
Memory Management: Adds torch.cuda.empty_cache() calls to free up GPU memory during LoRA merge and weight setting/getting processes.
CPU Cloning: Introduces a _cpu_clone method to move tensors to the CPU, ensuring weights are cloned on the CPU to reduce GPU memory usage.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to reduce GPU memory usage and prevent out-of-memory (OOM) errors by cloning weights to the CPU and strategically clearing the CUDA cache. The introduction of the _cpu_clone method is a good approach to manage memory for tensor copies. The widespread replacement of .clone() with _cpu_clone() and the addition of torch.cuda.empty_cache() are consistent with this goal. My review includes a fix for a potential memory leak and a suggestion to improve performance by adjusting the placement of a torch.cuda.empty_cache() call.

swift/megatron/model/gpt_bridge.py

swift/megatron/tuners/lora.py

gemini-code-assist

Code Review

This pull request aims to fix Out-Of-Memory (OOM) errors by moving cloned weights to the CPU and explicitly clearing the CUDA cache. The introduction of the _cpu_clone method is a solid approach to reduce GPU memory pressure. However, the implementation of torch.cuda.empty_cache() is sometimes excessive, inconsistent, or redundant, which can impact performance and maintainability. I've provided specific suggestions to refactor these calls for better code quality and to fix a potential memory leak. Additionally, there's a minor point of clarification regarding the use of non_blocking=True in _cpu_clone. Overall, the changes are beneficial for memory management.

swift/megatron/model/gpt_bridge.py

swift/megatron/tuners/lora.py

gemini-code-assist

Code Review

This pull request introduces several good optimizations to reduce GPU memory usage during model weight conversion and LoRA merging. The main strategies are moving cloned tensors to CPU memory using a new _cpu_clone helper function and adding torch.cuda.empty_cache() calls after deleting large temporary tensors. These changes should effectively mitigate out-of-memory errors. The implementation is solid, but I've found one minor inconsistency where an additional torch.cuda.empty_cache() call could be added for better memory management.

swift/megatron/model/gpt_bridge.py

gemini-code-assist

Code Review

This pull request addresses potential out-of-memory (OOM) errors by cloning weights to the CPU instead of the GPU, and by adding explicit calls to torch.cuda.empty_cache() to free GPU memory sooner. The introduction of the _cpu_clone helper function is a good approach and it's applied consistently. The changes in swift/megatron/tuners/lora.py to clean up memory during LoRA merge/unmerge are also well-aligned with the PR's goal. I have a few suggestions to improve consistency and readability regarding the placement of torch.cuda.empty_cache() and memory cleanup calls.

swift/megatron/model/gpt_bridge.py

gemini-code-assist

Code Review

This pull request effectively addresses potential out-of-memory (OOM) errors during model weight conversion and LoRA merging. The changes are well-targeted and consist of two main strategies:

A new _cpu_clone helper method is introduced in gpt_bridge.py to ensure that temporary weight copies are created on the CPU instead of the GPU. This is a crucial change to reduce GPU memory pressure.
Several calls to torch.cuda.empty_cache() are added in both gpt_bridge.py and lora.py to proactively free unused cached GPU memory after large tensor operations.

The implementation is solid, and these changes should significantly improve the robustness of memory-intensive operations. I have one suggestion to simplify the _cpu_clone method for better maintainability.

swift/megatron/model/gpt_bridge.py

Jintao-Huang · 2025-11-22T08:47:41Z

Hello 😊

What situations would cause OOM? Could you provide a script?

HollowMan6 · 2025-11-22T09:42:36Z

Hi @Jintao-Huang ! We tried to fine-tune DeepSeek V3 with Megatron GRPO LoRA, and the OOM happened when it syncs weights between vLLM and megatron. Under such situation it's better to make those clones in CPU to mitigate the OOM issue considering the size of DeepSeek V3, but I think it would still be beneficial even for smaller models if they have small GPU memory.

The script we use is modified based on: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo/moe_colocate_lora.sh

hjh0119 · 2025-11-24T06:40:48Z

Thank you for your contribution!

Could we add a parameter to control whether to export the weights to CPU or GPU? By default, exporting to GPU would be preferable, as exporting to CPU may slow down the process.

gemini-code-assist

Code Review

This pull request introduces a new --offload_bridge flag to offload cloned weights to CPU, effectively mitigating Out-of-Memory (OOM) errors during weight conversion. The changes also include strategically placed torch.cuda.empty_cache() and gc.collect() calls to more aggressively manage GPU memory. The implementation is solid and the documentation updates are comprehensive. My review includes a few suggestions to further enhance memory management consistency by pairing gc.collect() with torch.cuda.empty_cache() and ensuring temporary tensors are explicitly deleted.

swift/megatron/model/gpt_bridge.py

swift/megatron/tuners/lora.py

HollowMan6 · 2025-11-24T10:05:47Z

Thank you for your feedback @hjh0119 ! I've just added offload_bridge for controlling this and updated the documentation. Feel free to directly modify this PR if you have other preferences!

Jintao-Huang · 2025-11-24T10:08:10Z

The export_weights function has a target_device parameter; setting it to cpu enables offloading.

ms-swift/swift/megatron/model/gpt_bridge.py

Lines 1054 to 1059 in f4493a3

    
           def export_weights(self, 
        
                              mg_models, 
        
                              target_device=None, 
        
                              only_last_rank: bool = False, 
        
                              is_peft_format: bool = False, 
        
                              tqdm_desc: str = 'Exporting: '):

HollowMan6 · 2025-11-24T10:25:18Z

Ah, I thought it was used for other purposes. Thanks for pointing this out @Jintao-Huang I will test this out to see if this is working as expected. Is there any argument that has control over the target device for exporting weights?

gemini-code-assist

Code Review

This pull request introduces a new argument --offload_bridge to offload bridged weights to the CPU, which is a useful feature for preventing out-of-memory errors on the GPU during weight synchronization with vLLM. The implementation is correct, adding the argument and using it to move tensors to the CPU before they are cloned. The documentation and example scripts have been updated accordingly. I have one minor suggestion to improve the wording in one of the English documentation files for better clarity.

docs/source_en/Megatron-SWIFT/Mcore-Bridge.md

Jintao-Huang · 2025-11-24T12:01:40Z

https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html

Hello 😊. Using non_blocking is generally not recommended when transferring to CPU.

Now `offload_bridge` is a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False. Signed-off-by: Hollow Man <[email protected]>

HollowMan6 · 2025-11-24T12:05:48Z

Thanks again @Jintao-Huang ! I just updated the code and switch offload_bridge to use the target_device param. Also I just removed the non_blocking. Please let me know if you have further suggestions! I'll test this out when I have enough resources here.

hjh0119 · 2025-11-24T12:06:36Z

LGTM

HollowMan6 · 2025-11-24T18:36:25Z

Tested the new change and works well on my side!

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/tuners/lora.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch 2 times, most recently from 063e48c to 8cafd68 Compare November 21, 2025 22:24

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/tuners/lora.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from 8cafd68 to 9ae6302 Compare November 21, 2025 22:29

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from 9ae6302 to def3502 Compare November 21, 2025 22:36

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from def3502 to 0b15715 Compare November 21, 2025 23:42

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from 0b15715 to 6f4a17f Compare November 21, 2025 23:49

HollowMan6 force-pushed the cpu_bridge branch from 6f4a17f to 9db6e1d Compare November 22, 2025 22:29

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/model/gpt_bridge.py Outdated Show resolved Hide resolved

swift/megatron/tuners/lora.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from 31fd585 to a277622 Compare November 24, 2025 11:53

HollowMan6 changed the title ~~[megatron] fix: make bridge exported cloned weights store on CPU~~ [megatron] feat: add arg to offload bridged weights to CPU Nov 24, 2025

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

docs/source_en/Megatron-SWIFT/Mcore-Bridge.md Outdated Show resolved Hide resolved

HollowMan6 force-pushed the cpu_bridge branch from a277622 to 532908e Compare November 24, 2025 11:59

[megatron] feat: add arg to offload bridged weights to CPU

7984a23

Now `offload_bridge` is a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False. Signed-off-by: Hollow Man <[email protected]>

HollowMan6 force-pushed the cpu_bridge branch from 532908e to 7984a23 Compare November 24, 2025 12:03

hjh0119 approved these changes Nov 24, 2025

View reviewed changes

[megatron] feat: add arg to offload bridged weights to CPU #6714

Are you sure you want to change the base?

[megatron] feat: add arg to offload bridged weights to CPU #6714

Uh oh!

Conversation

HollowMan6 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jintao-Huang commented Nov 22, 2025

Uh oh!

HollowMan6 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjh0119 commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Nov 24, 2025

Uh oh!

Jintao-Huang commented Nov 24, 2025

Uh oh!

HollowMan6 commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jintao-Huang commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HollowMan6 commented Nov 24, 2025

Uh oh!

hjh0119 commented Nov 24, 2025

Uh oh!

HollowMan6 commented Nov 24, 2025

Uh oh!

HollowMan6 commented Nov 21, 2025 •

edited

Loading

HollowMan6 commented Nov 22, 2025 •

edited

Loading

Jintao-Huang commented Nov 24, 2025 •

edited

Loading