Conversation
There was a problem hiding this comment.
Pull request overview
Refactors AutoRound toward a new “context + compressor + algorithm” architecture, introducing new compressors_new/ and context/ modules and updating scheme parsing/export helpers to support the new flow.
Changes:
- Added new context singletons (
ModelContext,CompressContext) and a newcompressors_newimplementation path. - Expanded scheme parsing to reconcile
bits/data_typeand support user overrides + AutoScheme integration. - Added new calibration utilities and algorithm scaffolding for quantization backends (AutoRound/RTN).
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/utils/model.py | Avoids runtime import cycles via TYPE_CHECKING for QuantizationScheme. |
| auto_round/schemes.py | Adds scheme override + parsing helpers and bits/dtype reconciliation. |
| auto_round/formats.py | Switches divisibility checks to global supported-layer constants. |
| auto_round/context/model_context.py | Introduces model lifecycle/loading + AMP setup and forward-hook management. |
| auto_round/context/compress_context.py | Introduces device/device_map and memory-usage knobs as shared context. |
| auto_round/context/base.py | Adds simple singleton context base. |
| auto_round/context/init.py | Package init for new context module. |
| auto_round/compressors_new/utils.py | New utility module (layer config, gguf mapping, caching helpers, forward helpers). |
| auto_round/compressors_new/shard_writer.py | New shard-based saver with optional safetensors support. |
| auto_round/compressors_new/config.py | Introduces extra/legacy config dataclasses for the new compressor path. |
| auto_round/compressors_new/base.py | New “BaseCompressor” implementation wiring contexts, formats, caching, quant loop. |
| auto_round/compressors_new/init.py | Package init for compressors_new. |
| auto_round/compressors/utils.py | Extends legacy layer-config resolution to include safetensors-only tensors and skip missing modules. |
| auto_round/calibration/utils.py | Adds helpers for “early stop” caching and input reshaping for block tuning. |
| auto_round/calibration/init.py | Package init for calibration. |
| auto_round/algorithms/quantization/rtn/rtn.py | Adds placeholder RTN quantization module file. |
| auto_round/algorithms/quantization/rtn/config.py | Adds RTN algorithm config stub. |
| auto_round/algorithms/quantization/rtn/init.py | Package init for RTN quantization. |
| auto_round/algorithms/quantization/base.py | Adds base quantization class stub. |
| auto_round/algorithms/quantization/auto_round/quantize.py | Adds new AutoRound quantizer implementation (algorithm object). |
| auto_round/algorithms/quantization/auto_round/config.py | Adds new AutoRound algorithm config. |
| auto_round/algorithms/quantization/auto_round/init.py | Package init for AutoRound quantization algorithm. |
| auto_round/algorithms/quantization/init.py | Package init for quantization algorithms. |
| auto_round/algorithms/base.py | Adds base algorithm stub. |
| auto_round/algorithms/alg_config.py | Adds base algorithm config stub. |
| auto_round/algorithms/init.py | Package init for algorithms. |
|
If there is already an algorithm folder, what is the purpose of the compressor folder? |
…uo/new_ar_arch
…uo/new_ar_arch
…uo/new_ar_arch
Signed-off-by: n1ck-guo <heng.guo@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…uo/new_ar_arch
yiliu30
left a comment
There was a problem hiding this comment.
Do we have any E2E tests for sequential quantizers?
| @@ -0,0 +1,13 @@ | |||
| # Copyright (c) 2026 Intel Corporation | |||
…uo/new_ar_arch
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: n1ck-guo <heng.guo@intel.com>
| self._immediate_pack_and_save_module(name) | ||
|
|
||
| def _immediate_pack_and_save_module(self, module_name): | ||
| shard_writer = ShardWriter.get_shard_writer() |
There was a problem hiding this comment.
could packing and saving be decoupled from quantization process?
| enable_norm_bias_tuning (bool): Whether to enable fast norm/layer_bias tuning | ||
| """ | ||
|
|
||
| _alg_cls = "SignRoundQuantizer" |
There was a problem hiding this comment.
Is there a better way to map these two? Would it be better to provide a clear function that developers are required to implement?
wenhuach21
left a comment
There was a problem hiding this comment.
Thank you very much for the great effort!
| dynamic_max_gap: int = -1, | ||
| enable_quanted_input: bool = True, | ||
| optimizer: str = None, | ||
| enable_adam: bool = False, |
There was a problem hiding this comment.
as adam is decoupled, could we remove this argument from the config
| # Subclasses that support diffusion models should override this with the | ||
| # appropriate output key mapping, e.g.: | ||
| # DIFFUSION_OUTPUT_CONFIGS = {"FluxTransformerBlock": ["encoder_hidden_states", "hidden_states"]} | ||
| DIFFUSION_OUTPUT_CONFIGS: dict = {} |
There was a problem hiding this comment.
this argument should be added to the AutoRound interface instead of this one
|
|
||
| @property | ||
| def amp_dtype(self): | ||
| import torch |
There was a problem hiding this comment.
amp is only for tuning algorithms, so it's better to refine it. No need to refine it in this pr
|
|
||
| return getattr(self.model_context, "amp_dtype", torch.float32) | ||
|
|
||
| def _register_act_max_hook(self, model): |
There was a problem hiding this comment.
we should provide an interface to support customized hooks and should not register act_max_hook by default, which is not required by most algortihm
|
|
||
| @torch.inference_mode() | ||
| def _quantize_embedding_layer(self): | ||
| """Quantizes embedding layers in the model according to the configuration. |
There was a problem hiding this comment.
To align the function with other funcitons, this one should be changed to _quantize_embedding_layer(self, layer), and this one should also be designed to be overridden by subclasses. If it's difficult, feel free to support it in the futhure
| output keys. Subclasses override ``DIFFUSION_OUTPUT_CONFIGS`` to add | ||
| support for new diffusion architectures. | ||
| """ | ||
| output = defaultdict(list) |
There was a problem hiding this comment.
I prefer to move this one to utils and decouple the quantizer from model types
|
This PR will not make any further feature changes. I will collect all relevant comments and then modify them in future PRs. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…uo/new_ar_arch
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…ntext init - _hardware_setup: apply act-quantize/alg-ext guard before compile_func, matching _resolve_block_forward() and old-arch behavior. On HPU where enable_torch_compile stays True for FP8_STATIC, this avoids creating a compiled graph that wastes ~264 MB of HPU memory. - ModelContext.__init__: gc.collect + malloc_trim after model/tokenizer loading to reclaim C heap fragmentation (~96 MB). Signed-off-by: n1ck-guo <heng.guo@intel.com>
…init reorder - Add _force_trim_malloc() in device.py that unconditionally calls malloc_trim(0), bypassing the counter-based throttle in _maybe_trim_malloc() which was skipping critical lifecycle trim points - ClearMemory HPU path: replace _maybe_trim_malloc() with _force_trim_malloc() so heap pages are reclaimed before each MemoryMonitor RSS sample, preventing inflated peak_ram readings - ModelContext._load_model: add gc.collect + _force_trim_malloc before llm_load_model to reclaim temporary HTTP/config objects from is_mllm_model/is_diffusion_model/AutoConfig.from_pretrained calls - ModelContext.__init__: use _force_trim_malloc at end so the trim actually fires (previously _maybe_trim_malloc was a no-op at counter=1) - BaseCompressor.__init__: reorder context creation so ModelContext (large model allocation) is created before CompressContext (small), matching OLD arch allocation order to reduce heap fragmentation - BaseCompressor.post_init: add gc.collect + _force_trim_malloc after the five init phases to start quantize loop from tighter baseline - CalibCompressor.quantize: use _force_trim_malloc at loop start
xin3he
left a comment
There was a problem hiding this comment.
LGTM, please get the approval from Wenhua and Liang.
Description
Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.
Usage of new api:
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting