-
Notifications
You must be signed in to change notification settings - Fork 59
Minor refactor for LLMC #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
yiliu30
wants to merge
33
commits into
main
Choose a base branch
from
llmc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+107
−64
Draft
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
5ee2c2d
init moe support
yiliu30 c278f9d
add test
yiliu30 418e6a0
fix import
yiliu30 184783f
clean envs
yiliu30 b9da06f
add script for apply ext
yiliu30 187f38d
clean docs
yiliu30 4031724
fix license
yiliu30 5fe01ef
fix
yiliu30 73f1e9b
fix import and sitecustomize
yiliu30 8495854
move to ext
yiliu30 c473934
update mxfp4
yiliu30 9f65bd1
fix
yiliu30 8038a5f
fix model name
yiliu30 e0872b6
Merge branch 'main' into vllm-ext
yiliu30 c82bce1
fix
yiliu30 19e18c7
Merge branch 'vllm-ext' of https://github.com/intel/auto-round into v…
yiliu30 adf7ebf
use absolute path
yiliu30 59f5cd2
Merge branch 'main' into vllm-ext
yiliu30 8f27041
Merge branch 'main' into vllm-ext
yiliu30 ad8537c
fix
yiliu30 77844f6
mark round method as todo
yiliu30 ce985ef
tmp wa for llmc
yiliu30 8832530
tmp wa for llmc
yiliu30 361491f
return ds
yiliu30 db65d74
add more log
yiliu30 60a0023
refine code
yiliu30 2f96c13
Merge branch 'llmc' of https://github.com/intel/auto-round into llmc
yiliu30 7a1716e
refactor
a20f9df
refactor
553ee5c
fix offloaf
2bd3c4b
fix
b992c31
remove time
yiliu30 0354c2b
update
yiliu30 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,7 +20,7 @@ | |
| import traceback | ||
| from collections import defaultdict | ||
| from dataclasses import asdict, fields | ||
| from typing import Any, Callable, Union | ||
| from typing import Any, Callable, Optional, Union | ||
|
|
||
| import accelerate | ||
| import torch | ||
|
|
@@ -85,6 +85,7 @@ | |
| is_hpex_available, | ||
| llm_load_model, | ||
| mv_module_from_gpu, | ||
| normalize_input, | ||
| set_amax_for_all_moe_layers, | ||
| set_module, | ||
| to_device, | ||
|
|
@@ -351,7 +352,8 @@ def __init__( | |
| # Some helpers | ||
| if "hpu" in str(self.device): | ||
| self.inner_supported_types = tuple(x for x in INNER_SUPPORTED_LAYER_TYPES if x != "FP8Linear") | ||
| self.batch_dim = None | ||
| # TODO: check with heng/weiwei | ||
| self.batch_dim = 0 | ||
| self.infer_bs_coeff = 1 | ||
|
|
||
| self.block_forward = compile_func(block_forward, self.device) if self.enable_torch_compile else block_forward | ||
|
|
@@ -1495,6 +1497,21 @@ def _update_inputs(self, inputs: dict, q_inputs: dict) -> tuple[dict, torch.Tens | |
| q_inputs = q_inputs.pop(input_id_str[0], None) | ||
| return inputs, q_inputs | ||
|
|
||
| def configure_layer_config(self, enable_gguf_official_mixed: None | bool = False): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better set the enable_gguf_official_mixed to True by default |
||
| self.layer_config, self.has_qlayer_outside_block, self.regex_config = set_layer_config( | ||
| self.model, | ||
| self.layer_config, | ||
| self.scheme, | ||
| self.scale_dtype, | ||
| self.supported_types, | ||
| self.inner_supported_types, | ||
| self.quant_block_list, | ||
| self.fp_layers, | ||
| self.quant_lm_head, | ||
| enable_gguf_official_mixed=enable_gguf_official_mixed, | ||
| is_mllm=self.mllm, | ||
| ) | ||
|
|
||
| def quantize(self) -> tuple[torch.nn.Module, dict[str, Any]]: | ||
| """Quantize the model and return the quantized model along with layer configurations.The entry of AutoRound. | ||
| Returns: | ||
|
|
@@ -1513,20 +1530,8 @@ def quantize(self) -> tuple[torch.nn.Module, dict[str, Any]]: | |
| enable_gguf_official_mixed = True | ||
| else: | ||
| enable_gguf_official_mixed = False | ||
| self.layer_config, self.has_qlayer_outside_block, self.regex_config = set_layer_config( | ||
| self.model, | ||
| self.layer_config, | ||
| self.scheme, | ||
| self.scale_dtype, | ||
| self.supported_types, | ||
| self.inner_supported_types, | ||
| self.quant_block_list, | ||
| self.fp_layers, | ||
| self.quant_lm_head, | ||
| enable_gguf_official_mixed=enable_gguf_official_mixed, | ||
| is_mllm=self.mllm, | ||
| ) | ||
|
|
||
| self.configure_layer_config(enable_gguf_official_mixed=enable_gguf_official_mixed) | ||
| if not hasattr(self, "formats"): | ||
| logger.warning("this API is deprecated, please use `quantize_and_save` instead") | ||
| else: | ||
|
|
@@ -2420,13 +2425,14 @@ def _get_current_num_elm( | |
| current_input_ids = [input_ids[i] for i in indices] | ||
| return sum(id.numel() for id in current_input_ids) | ||
|
|
||
| def _quantize_block( | ||
| def quantize_block( | ||
| self, | ||
| block: torch.nn.Module, | ||
| input_ids: Union[list[torch.Tensor], dict], | ||
| input_others: dict, | ||
| inputs: tuple[Union[list[torch.Tensor], dict, Any], Optional[dict]], | ||
| q_input: Union[torch.Tensor, dict, None] = None, | ||
| normalize_inputs: bool = False, | ||
| device: Union[str, torch.device] = "cpu", | ||
| auto_offload=True, | ||
| ): | ||
| """Quantize the weights of a given block of the model. | ||
|
|
||
|
|
@@ -2445,30 +2451,34 @@ def _quantize_block( | |
| if is_fp8_linear(m): | ||
| new_layer = convert_fp8_layer_to_linear(m, self.amp_dtype).to(device) | ||
| set_module(block, n, new_layer) | ||
| if normalize_inputs: | ||
| input_ids, input_others = normalize_input(inputs) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not moving these change to llmc side |
||
| else: | ||
| input_ids, input_others = inputs | ||
| if auto_offload: | ||
| if self.device_map == "auto" or (isinstance(self.device_map, str) and "," in self.device_map): | ||
| set_auto_device_map_for_block_with_tuning( | ||
| block, self.device_map, input_ids, self.low_gpu_mem_usage, self.mem_per_param_scale | ||
| ) | ||
|
|
||
| if self.device_map == "auto" or (isinstance(self.device_map, str) and "," in self.device_map): | ||
| set_auto_device_map_for_block_with_tuning( | ||
| block, self.device_map, input_ids, self.low_gpu_mem_usage, self.mem_per_param_scale | ||
| ) | ||
|
|
||
| if self.device_map is not None: | ||
| for n, m in block.named_modules(): | ||
| if len(list(m.children())) != 0 or not hasattr(m, "tuning_device"): | ||
| continue | ||
| from accelerate.hooks import AlignDevicesHook, add_hook_to_module | ||
| if self.device_map is not None: | ||
| for n, m in block.named_modules(): | ||
| if len(list(m.children())) != 0 or not hasattr(m, "tuning_device"): | ||
| continue | ||
| from accelerate.hooks import AlignDevicesHook, add_hook_to_module | ||
|
|
||
| hook = AlignDevicesHook(m.tuning_device, io_same_device=True) | ||
| add_hook_to_module(m, hook, True) | ||
| hook = AlignDevicesHook(m.tuning_device, io_same_device=True) | ||
| add_hook_to_module(m, hook, True) | ||
|
|
||
| if q_input is None: | ||
| hook_handles = self._register_act_max_hook(block) | ||
|
|
||
| output = self._get_block_outputs( | ||
| block, input_ids, input_others, self.batch_size * self.infer_bs_coeff, device, self.cache_device | ||
| ) | ||
|
|
||
| for handle in hook_handles: | ||
| handle.remove() | ||
| if auto_offload: | ||
| for handle in hook_handles: | ||
| handle.remove() | ||
| else: | ||
| output = self._get_block_outputs( | ||
| block, input_ids, input_others, self.batch_size * self.infer_bs_coeff, device, self.cache_device | ||
|
|
@@ -2565,6 +2575,7 @@ def _quantize_block( | |
| best_params = {} | ||
| total_loss = 0 | ||
| for i in range(self.iters): | ||
| logger.trace(f"Quant block iteration {i}/{self.iters}, best loss so far: {best_loss}") | ||
| total_loss = 0 | ||
| if self.sampler == "rand": | ||
| whole_indices = torch.randperm(nsamples)[:pick_samples] | ||
|
|
@@ -2587,7 +2598,7 @@ def _quantize_block( | |
| else: | ||
| tmp_attention_mask = 1.0 | ||
| if self.amp: | ||
| with autocast(device_type=device.split(":")[0], dtype=self.amp_dtype): | ||
| with autocast(device_type=str(device).split(":")[0], dtype=self.amp_dtype): | ||
| loss = mse_loss( # pylint: disable=not-callable | ||
| output_q * tmp_attention_mask, current_output * tmp_attention_mask | ||
| ) | ||
|
|
@@ -2636,7 +2647,7 @@ def _quantize_block( | |
| if is_nv_fp(self.act_data_type): | ||
| # enable moe experts act_max automatic generation for WrapperWALayer | ||
| set_amax_for_all_moe_layers(block, attr_name="orig_layer.act_max") | ||
|
|
||
| q_outputs = None | ||
| if self.enable_quanted_input: | ||
| clear_memory() | ||
| q_outputs = self._get_block_outputs( | ||
|
|
@@ -2647,19 +2658,13 @@ def _quantize_block( | |
| device, | ||
| cache_device=self.cache_device, | ||
| ) | ||
| if auto_offload: | ||
| if self.device_map is not None: | ||
| accelerate.hooks.remove_hook_from_submodules(block) | ||
| mv_module_from_gpu(block) | ||
| clear_memory(input_ids) | ||
|
|
||
| return q_outputs, output | ||
| clear_memory(input_ids) | ||
|
|
||
| else: | ||
| if self.device_map is not None: | ||
| accelerate.hooks.remove_hook_from_submodules(block) | ||
| mv_module_from_gpu(block) | ||
| clear_memory(input_ids) | ||
| return None, output | ||
| return q_outputs, output | ||
|
|
||
| def _split_inputs(self, inputs: dict) -> tuple[torch.Tensor, dict]: | ||
| input_ids = inputs["input_ids"] | ||
|
|
@@ -2733,9 +2738,9 @@ def _quantize_blocks( | |
| else: | ||
| logger.info("using algorithm extension for quantization.") | ||
| except (ImportError, ModuleNotFoundError): | ||
| quantize_block = self._quantize_block | ||
| quantize_block = self.quantize_block | ||
| else: | ||
| quantize_block = self._quantize_block | ||
| quantize_block = self.quantize_block | ||
|
|
||
| if pbar is None: | ||
| pbar = tqdm(range(0, len(block_names), nblocks)) | ||
|
|
@@ -2756,8 +2761,7 @@ def _quantize_blocks( | |
| m = m.to(device) | ||
| q_input, input_ids = quantize_block( | ||
| m, | ||
| input_ids, | ||
| input_others, | ||
| (input_ids, input_others), | ||
| q_input=q_input, | ||
| device=device, | ||
| ) | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is required, hidden in kwargs, and add comments