Skip to content

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Nov 3, 2025

  • Reduce peak memory usage by calling clear_memory cosidering performance effort.
  • Move best_params to CPU and make sure clear memory before moving back.
  • move loss device to the second card if card_0_in_high_risk
  • support Deepseek R1 W4A16 tuning with 3 CUDA cards (80GB) (--enable_torch_compile)
  • support llama3.3 70B W4A16 tuning with 2 Intel GPU cards (24GB)(--enable_torch_compile)

@xin3he xin3he changed the title Xinhe/fix Reduce peak gpu memory usage and support moe estimation Nov 3, 2025
@xin3he xin3he requested review from n1ck-guo, wenhuach21 and yiliu30 and removed request for wenhuach21 November 3, 2025 08:23
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
@xin3he
Copy link
Contributor Author

xin3he commented Nov 3, 2025

This PR is not the suitable solution, memory estimation for moe is not reasonable.
#976

@xin3he xin3he requested a review from Copilot November 4, 2025 03:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reduces peak GPU memory usage and enables support for MoE (Mixture of Experts) model estimation. It focuses on memory optimization by moving best parameters to CPU during low memory situations and implementing smarter memory clearing strategies based on memory pressure detection.

Key changes:

  • Enhanced memory management with CPU offloading of best parameters during low GPU memory usage
  • Added MoE model memory estimation considering only active experts rather than all experts
  • Improved memory clearing logic with risk-based thresholds and better performance monitoring

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
auto_round/wrapper.py Move best parameters from CPU back to device during unwrapping
auto_round/utils/device.py Add MoE memory estimation, improve memory clearing logic, and enhance device memory calculations
auto_round/compressors/utils.py Add CPU offloading option for best parameters collection
auto_round/compressors/base.py Integrate risk-based memory clearing and CPU offloading throughout quantization process

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: He, Xin3 <[email protected]>
@xin3he xin3he requested a review from wenhuach21 November 4, 2025 05:37
Signed-off-by: He, Xin3 <[email protected]>
@xin3he
Copy link
Contributor Author

xin3he commented Nov 4, 2025

got OOM when quantizing block 11/61 for DeepSeek on CUDA 3x 80GB cards
got OOM when quantizing block 14/80 for llama 70b on Intel GPU 3x 24GB cards

Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants