Reduce peak gpu memory usage and support moe estimation #981

xin3he · 2025-11-03T08:22:36Z

Reduce peak memory usage by calling clear_memory cosidering performance effort.
Move best_params to CPU and make sure clear memory before moving back.
move loss device to the second card if card_0_in_high_risk
support Deepseek R1 W4A16 tuning with 3 CUDA cards (80GB) （--enable_torch_compile)
support llama3.3 70B W4A16 tuning with 2 Intel GPU cards (24GB)（--enable_torch_compile)

Signed-off-by: He, Xin3 <[email protected]>

xin3he · 2025-11-03T12:21:19Z

This PR is not the suitable solution, memory estimation for moe is not reasonable.
#976

auto_round/compressors/utils.py

auto_round/compressors/base.py

auto_round/utils/device.py

Signed-off-by: He, Xin3 <[email protected]>

Copilot

Pull Request Overview

This PR reduces peak GPU memory usage and enables support for MoE (Mixture of Experts) model estimation. It focuses on memory optimization by moving best parameters to CPU during low memory situations and implementing smarter memory clearing strategies based on memory pressure detection.

Key changes:

Enhanced memory management with CPU offloading of best parameters during low GPU memory usage
Added MoE model memory estimation considering only active experts rather than all experts
Improved memory clearing logic with risk-based thresholds and better performance monitoring

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
auto_round/wrapper.py	Move best parameters from CPU back to device during unwrapping
auto_round/utils/device.py	Add MoE memory estimation, improve memory clearing logic, and enhance device memory calculations
auto_round/compressors/utils.py	Add CPU offloading option for best parameters collection
auto_round/compressors/base.py	Integrate risk-based memory clearing and CPU offloading throughout quantization process

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

auto_round/utils/device.py

auto_round/compressors/base.py

Signed-off-by: He, Xin3 <[email protected]>

auto_round/compressors/base.py

Signed-off-by: He, Xin3 <[email protected]>

xin3he · 2025-11-04T06:47:55Z

got OOM when quantizing block 11/61 for DeepSeek on CUDA 3x 80GB cards
got OOM when quantizing block 14/80 for llama 70b on Intel GPU 3x 24GB cards

Signed-off-by: He, Xin3 <[email protected]>

auto_round/compressors/base.py

Signed-off-by: He, Xin3 <[email protected]>

xin3he added 3 commits November 2, 2025 21:30

add moe support

0055d30

Signed-off-by: He, Xin3 <[email protected]>

use low_gpu_mem_usage to cache best params

29df357

Signed-off-by: He, Xin3 <[email protected]>

add warning for memory estimation

d8917f9

Signed-off-by: He, Xin3 <[email protected]>

xin3he changed the title ~~Xinhe/fix~~ Reduce peak gpu memory usage and support moe estimation Nov 3, 2025

xin3he requested review from n1ck-guo, wenhuach21 and yiliu30 and removed request for wenhuach21 November 3, 2025 08:23

fix bug

e04db30

Signed-off-by: He, Xin3 <[email protected]>

xin3he force-pushed the xinhe/fix branch from f98c811 to c1568dc Compare November 3, 2025 11:03

xin3he added 4 commits November 3, 2025 06:03

support ds on CUDA and 70b on XPU

c1568dc

Signed-off-by: He, Xin3 <[email protected]>

fix oom of deepseek

550439d

Signed-off-by: He, Xin3 <[email protected]>

fix pylint

cbd02db

Signed-off-by: He, Xin3 <[email protected]>

threshold is 0.8

f466677

Signed-off-by: He, Xin3 <[email protected]>