Skip to content

Conversation

@HollowMan6
Copy link

@HollowMan6 HollowMan6 commented Nov 21, 2025

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Now offload_bridge is a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.

Experiment results

Paste your experiment result here(if needed).

✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HollowMan6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving memory management within the GPTBridge code to prevent OOM errors. It introduces CPU cloning for weights and adds explicit garbage collection to free up GPU memory during LoRA merging and weight manipulation, enhancing the stability and efficiency of the model.

Highlights

  • OOM Prevention: Addresses potential Out-of-Memory (OOM) errors by cloning weights to the CPU during GPTBridge operations.
  • Memory Management: Adds torch.cuda.empty_cache() calls to free up GPU memory during LoRA merge and weight setting/getting processes.
  • CPU Cloning: Introduces a _cpu_clone method to move tensors to the CPU, ensuring weights are cloned on the CPU to reduce GPU memory usage.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce GPU memory usage and prevent out-of-memory (OOM) errors by cloning weights to the CPU and strategically clearing the CUDA cache. The introduction of the _cpu_clone method is a good approach to manage memory for tensor copies. The widespread replacement of .clone() with _cpu_clone() and the addition of torch.cuda.empty_cache() are consistent with this goal. My review includes a fix for a potential memory leak and a suggestion to improve performance by adjusting the placement of a torch.cuda.empty_cache() call.

@HollowMan6 HollowMan6 force-pushed the cpu_bridge branch 2 times, most recently from 063e48c to 8cafd68 Compare November 21, 2025 22:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix Out-Of-Memory (OOM) errors by moving cloned weights to the CPU and explicitly clearing the CUDA cache. The introduction of the _cpu_clone method is a solid approach to reduce GPU memory pressure. However, the implementation of torch.cuda.empty_cache() is sometimes excessive, inconsistent, or redundant, which can impact performance and maintainability. I've provided specific suggestions to refactor these calls for better code quality and to fix a potential memory leak. Additionally, there's a minor point of clarification regarding the use of non_blocking=True in _cpu_clone. Overall, the changes are beneficial for memory management.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several good optimizations to reduce GPU memory usage during model weight conversion and LoRA merging. The main strategies are moving cloned tensors to CPU memory using a new _cpu_clone helper function and adding torch.cuda.empty_cache() calls after deleting large temporary tensors. These changes should effectively mitigate out-of-memory errors. The implementation is solid, but I've found one minor inconsistency where an additional torch.cuda.empty_cache() call could be added for better memory management.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses potential out-of-memory (OOM) errors by cloning weights to the CPU instead of the GPU, and by adding explicit calls to torch.cuda.empty_cache() to free GPU memory sooner. The introduction of the _cpu_clone helper function is a good approach and it's applied consistently. The changes in swift/megatron/tuners/lora.py to clean up memory during LoRA merge/unmerge are also well-aligned with the PR's goal. I have a few suggestions to improve consistency and readability regarding the placement of torch.cuda.empty_cache() and memory cleanup calls.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses potential out-of-memory (OOM) errors during model weight conversion and LoRA merging. The changes are well-targeted and consist of two main strategies:

  1. A new _cpu_clone helper method is introduced in gpt_bridge.py to ensure that temporary weight copies are created on the CPU instead of the GPU. This is a crucial change to reduce GPU memory pressure.
  2. Several calls to torch.cuda.empty_cache() are added in both gpt_bridge.py and lora.py to proactively free unused cached GPU memory after large tensor operations.

The implementation is solid, and these changes should significantly improve the robustness of memory-intensive operations. I have one suggestion to simplify the _cpu_clone method for better maintainability.

@Jintao-Huang
Copy link
Collaborator

Hello 😊

What situations would cause OOM? Could you provide a script?

@HollowMan6
Copy link
Author

HollowMan6 commented Nov 22, 2025

Hi @Jintao-Huang ! We tried to fine-tune DeepSeek V3 with Megatron GRPO LoRA, and the OOM happened when it syncs weights between vLLM and megatron. Under such situation it's better to make those clones in CPU to mitigate the OOM issue considering the size of DeepSeek V3, but I think it would still be beneficial even for smaller models if they have small GPU memory.

The script we use is modified based on: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo/moe_colocate_lora.sh

@hjh0119
Copy link
Collaborator

hjh0119 commented Nov 24, 2025

Thank you for your contribution!

Could we add a parameter to control whether to export the weights to CPU or GPU? By default, exporting to GPU would be preferable, as exporting to CPU may slow down the process.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new --offload_bridge flag to offload cloned weights to CPU, effectively mitigating Out-of-Memory (OOM) errors during weight conversion. The changes also include strategically placed torch.cuda.empty_cache() and gc.collect() calls to more aggressively manage GPU memory. The implementation is solid and the documentation updates are comprehensive. My review includes a few suggestions to further enhance memory management consistency by pairing gc.collect() with torch.cuda.empty_cache() and ensuring temporary tensors are explicitly deleted.

@HollowMan6
Copy link
Author

Thank you for your feedback @hjh0119 ! I've just added offload_bridge for controlling this and updated the documentation. Feel free to directly modify this PR if you have other preferences!

@Jintao-Huang
Copy link
Collaborator

The export_weights function has a target_device parameter; setting it to cpu enables offloading.

def export_weights(self,
mg_models,
target_device=None,
only_last_rank: bool = False,
is_peft_format: bool = False,
tqdm_desc: str = 'Exporting: '):

@HollowMan6
Copy link
Author

Ah, I thought it was used for other purposes. Thanks for pointing this out @Jintao-Huang I will test this out to see if this is working as expected. Is there any argument that has control over the target device for exporting weights?

@HollowMan6 HollowMan6 changed the title [megatron] fix: make bridge exported cloned weights store on CPU [megatron] feat: add arg to offload bridged weights to CPU Nov 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new argument --offload_bridge to offload bridged weights to the CPU, which is a useful feature for preventing out-of-memory errors on the GPU during weight synchronization with vLLM. The implementation is correct, adding the argument and using it to move tensors to the CPU before they are cloned. The documentation and example scripts have been updated accordingly. I have one minor suggestion to improve the wording in one of the English documentation files for better clarity.

@Jintao-Huang
Copy link
Collaborator

Jintao-Huang commented Nov 24, 2025

截屏2025-11-24 20 01 30

https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html

Hello 😊. Using non_blocking is generally not recommended when transferring to CPU.

Now `offload_bridge` is a supported option to store Megatron
exported HF format weights for vLLM updates in CPU main
memory to reduce GPU memory usage. Default is False.

Signed-off-by: Hollow Man <[email protected]>
@HollowMan6
Copy link
Author

Thanks again @Jintao-Huang ! I just updated the code and switch offload_bridge to use the target_device param. Also I just removed the non_blocking. Please let me know if you have further suggestions! I'll test this out when I have enough resources here.

@hjh0119
Copy link
Collaborator

hjh0119 commented Nov 24, 2025

LGTM

@HollowMan6
Copy link
Author

Tested the new change and works well on my side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants