-
Notifications
You must be signed in to change notification settings - Fork 994
[megatron] feat: add arg to offload bridged weights to CPU #6714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @HollowMan6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on improving memory management within the GPTBridge code to prevent OOM errors. It introduces CPU cloning for weights and adds explicit garbage collection to free up GPU memory during LoRA merging and weight manipulation, enhancing the stability and efficiency of the model. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to reduce GPU memory usage and prevent out-of-memory (OOM) errors by cloning weights to the CPU and strategically clearing the CUDA cache. The introduction of the _cpu_clone method is a good approach to manage memory for tensor copies. The widespread replacement of .clone() with _cpu_clone() and the addition of torch.cuda.empty_cache() are consistent with this goal. My review includes a fix for a potential memory leak and a suggestion to improve performance by adjusting the placement of a torch.cuda.empty_cache() call.
063e48c to
8cafd68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix Out-Of-Memory (OOM) errors by moving cloned weights to the CPU and explicitly clearing the CUDA cache. The introduction of the _cpu_clone method is a solid approach to reduce GPU memory pressure. However, the implementation of torch.cuda.empty_cache() is sometimes excessive, inconsistent, or redundant, which can impact performance and maintainability. I've provided specific suggestions to refactor these calls for better code quality and to fix a potential memory leak. Additionally, there's a minor point of clarification regarding the use of non_blocking=True in _cpu_clone. Overall, the changes are beneficial for memory management.
8cafd68 to
9ae6302
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several good optimizations to reduce GPU memory usage during model weight conversion and LoRA merging. The main strategies are moving cloned tensors to CPU memory using a new _cpu_clone helper function and adding torch.cuda.empty_cache() calls after deleting large temporary tensors. These changes should effectively mitigate out-of-memory errors. The implementation is solid, but I've found one minor inconsistency where an additional torch.cuda.empty_cache() call could be added for better memory management.
9ae6302 to
def3502
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses potential out-of-memory (OOM) errors by cloning weights to the CPU instead of the GPU, and by adding explicit calls to torch.cuda.empty_cache() to free GPU memory sooner. The introduction of the _cpu_clone helper function is a good approach and it's applied consistently. The changes in swift/megatron/tuners/lora.py to clean up memory during LoRA merge/unmerge are also well-aligned with the PR's goal. I have a few suggestions to improve consistency and readability regarding the placement of torch.cuda.empty_cache() and memory cleanup calls.
def3502 to
0b15715
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses potential out-of-memory (OOM) errors during model weight conversion and LoRA merging. The changes are well-targeted and consist of two main strategies:
- A new
_cpu_clonehelper method is introduced ingpt_bridge.pyto ensure that temporary weight copies are created on the CPU instead of the GPU. This is a crucial change to reduce GPU memory pressure. - Several calls to
torch.cuda.empty_cache()are added in bothgpt_bridge.pyandlora.pyto proactively free unused cached GPU memory after large tensor operations.
The implementation is solid, and these changes should significantly improve the robustness of memory-intensive operations. I have one suggestion to simplify the _cpu_clone method for better maintainability.
0b15715 to
6f4a17f
Compare
|
Hello 😊 What situations would cause OOM? Could you provide a script? |
|
Hi @Jintao-Huang ! We tried to fine-tune DeepSeek V3 with Megatron GRPO LoRA, and the OOM happened when it syncs weights between vLLM and megatron. Under such situation it's better to make those clones in CPU to mitigate the OOM issue considering the size of DeepSeek V3, but I think it would still be beneficial even for smaller models if they have small GPU memory. The script we use is modified based on: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo/moe_colocate_lora.sh |
6f4a17f to
9db6e1d
Compare
|
Thank you for your contribution! Could we add a parameter to control whether to export the weights to CPU or GPU? By default, exporting to GPU would be preferable, as exporting to CPU may slow down the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new --offload_bridge flag to offload cloned weights to CPU, effectively mitigating Out-of-Memory (OOM) errors during weight conversion. The changes also include strategically placed torch.cuda.empty_cache() and gc.collect() calls to more aggressively manage GPU memory. The implementation is solid and the documentation updates are comprehensive. My review includes a few suggestions to further enhance memory management consistency by pairing gc.collect() with torch.cuda.empty_cache() and ensuring temporary tensors are explicitly deleted.
|
Thank you for your feedback @hjh0119 ! I've just added |
|
The export_weights function has a target_device parameter; setting it to cpu enables offloading. ms-swift/swift/megatron/model/gpt_bridge.py Lines 1054 to 1059 in f4493a3
|
|
Ah, I thought it was used for other purposes. Thanks for pointing this out @Jintao-Huang I will test this out to see if this is working as expected. Is there any argument that has control over the target device for exporting weights? |
31fd585 to
a277622
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new argument --offload_bridge to offload bridged weights to the CPU, which is a useful feature for preventing out-of-memory errors on the GPU during weight synchronization with vLLM. The implementation is correct, adding the argument and using it to move tensors to the CPU before they are cloned. The documentation and example scripts have been updated accordingly. I have one minor suggestion to improve the wording in one of the English documentation files for better clarity.
a277622 to
532908e
Compare
https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html Hello 😊. Using non_blocking is generally not recommended when transferring to CPU. |
Now `offload_bridge` is a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False. Signed-off-by: Hollow Man <[email protected]>
532908e to
7984a23
Compare
|
Thanks again @Jintao-Huang ! I just updated the code and switch |
|
LGTM |
|
Tested the new change and works well on my side! |

PR type
PR information
Now
offload_bridgeis a supported option to store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.Experiment results
Paste your experiment result here(if needed).
✨ Presented to you with Mind Lab - A Lab for Experiential Intelligence.