-
Notifications
You must be signed in to change notification settings - Fork 59
Reduce peak gpu memory usage and support moe estimation #981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
|
This PR is not the suitable solution, memory estimation for moe is not reasonable. |
Signed-off-by: He, Xin3 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reduces peak GPU memory usage and enables support for MoE (Mixture of Experts) model estimation. It focuses on memory optimization by moving best parameters to CPU during low memory situations and implementing smarter memory clearing strategies based on memory pressure detection.
Key changes:
- Enhanced memory management with CPU offloading of best parameters during low GPU memory usage
- Added MoE model memory estimation considering only active experts rather than all experts
- Improved memory clearing logic with risk-based thresholds and better performance monitoring
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| auto_round/wrapper.py | Move best parameters from CPU back to device during unwrapping |
| auto_round/utils/device.py | Add MoE memory estimation, improve memory clearing logic, and enhance device memory calculations |
| auto_round/compressors/utils.py | Add CPU offloading option for best parameters collection |
| auto_round/compressors/base.py | Integrate risk-based memory clearing and CPU offloading throughout quantization process |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
|
got OOM when quantizing block 11/61 for DeepSeek on CUDA 3x 80GB cards |
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.